Abstract
Motivation: In smallsample settings, bolstered error estimation has been shown to perform better than crossvalidation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a nonparametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for highdimensional feature spaces.
Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in realworld applications where the model is not known, but the other factors in determining the optimal kernel are known.
Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering
Contact:edward@mail.ece.tamu.edu
1 INTRODUCTION
Throughout most of the history of pattern recognition, the number of features was much smaller than the numbers currently being generated in highthroughput biology. Less than 15 years ago, in two studies on feature selection most cases considered involved <30 features and the maximum number considered was 65 (Jain and Zongker, 1997; Kudo and Sklansky, 2000). The advent of highthroughput technologies has radically altered the landscape. In conjunction with large numbers of features, bioinformatics is confronted by small sample sizes, often <100, which forces one to train and test on the same data, where bias, variance (BragaNeto and Dougherty, 2004b) and lack of correlation with the true error (Hanczar et al., 2007, 2010) can severely degrade error estimation. Performance can degrade even further in the presence of feature selection (Molinaro et al., 2005). Recent articles have pointed out the difficulty in establishing performance advantages for proposed classification rules (Boulesteix, 2010; Jelizarow et al., 2010; Rocke et al., 2009). Two statistically grounded sources of overoptimism have been highlighted: (i) applying a classification rule to numerous datasets and then reporting only the results on the dataset for which the designed classifier possesses the lowest estimated error (Yousefi et al., 2010); and (ii) applying multiple classification rules to a dataset and comparing the classification rules according to the estimated errors of the designed classifiers (Boulesteix and Strobl, 2009). In both cases, optimism is a result of inaccurate error estimation.
A good error estimator ideally would have small bias and small variance. This is a difficult tradeoff in smallsample settings. In smallsample cases, resubstitution generally has small variance but tends to be quite optimistically biased. Crossvalidation has small bias, but tends to display high variance. Bolstered error estimation (BragaNeto and Dougherty, 2004a) attempts to achieve a compromise to this biasvariance dilemma in smallsample settings. It is based on the idea of modifying (‘bolstering’) the empirical distribution of the data by placing kernels at each data point and then estimating classifier error by the error on this bolstered empirical distribution in such a way that it reduces bias, while at the same time reducing variance. Bolstered error estimation has shown good performance when compared with popular error estimators in smallsample settings, in particular, for featureset ranking and when used internally within a featureselection algorithm (Sima et al., 2005a) and for ranking feature sets (Sima et al., 2005b). Its good performance, including the latter applications, has been demonstrated in the context a small number of features, including feature selection via sequential forward selection (SFS), where it is applied to small potential feature sets in the SFS algorithm. A critical aspect of the method is selecting the right amount of bolstering, which is given by the variance of the bolstering kernels. The original bolstering paper (BragaNeto and Dougherty, 2004a) proposed a nonparametric estimator for the kernel variance, which was found empirically to perform well in lowdimensional spaces; however, estimation was found to degrade in high dimensions, so that a correction factor can be required (Vu and BragaNeto, 2008). In fact, it was demonstrated in a preliminary study that a correction factor can also be beneficial for lowdimensional bolstering (Huynh et al., 2007).
This leads us to consider optimal bolstering, specifically, finding an optimal variance for the bolstering kernels. Error estimators like resubstitution and crossvalidation (assuming the number of folds is preset) are nonparametric. They contain no free parameters. This is not the case for bootstrap. In general, bootstrap has the form of a convex error estimator, namely,
where and are the resubstitution and zerobootstrap estimators and 0≤a≤1. The zerobootstrap utilizes the empirical distribution F^{*}, which puts mass on each of the n available data points. A bootstrap sample S_{n}^{*} from F^{*} consists of n equallylikely draws with replacement from the original data S_{n}. The basic bootstrap zero estimator (Efron, 1983) is written in terms of the empirical distribution asIn practice, the expectation E_{F*} has to be approximated by a MonteCarlo estimate based on independent replicates S_{n}^{*b}, for b=1,…, B, in which case the classifier is designed on the bootstrap sample and tested on the original data points left out. An optimal bootstrap estimator results from a value of a that minimizes the meansquare error between and the true error for a given featurelabel distribution (Sima and Dougherty, 2006b). Setting a=0.632, as is commonly done (Efron, 1983), can lead to a far from optimal estimator (optimal weights).
The present article considers optimal bolstering relative to its one free parameter, kernel variance and the manner in which optimal bolstering can be used to arrive at practical implementation of bolstering in highdimensional feature space. The end product is an implementation protocol in which optimal kernel variances across different models are combined to produce a suitable kernel variance for the problem at hand. Throughout, we will assume feature selection because that would be the standard way to approach classification in the highdimensional setting we are considering, although this is not a mandatory requirement of the approach.
2 SYSTEMS AND METHODS
This section will be broken into subsections, with the aim of arriving at the implementation protocol for realworld data. Section 2.1 briefly reviews the necessary essentials of error estimation, mainly bolstering. Section 2.2 defines the scaling factor by which to adjust the bolstering kernel to high dimensions. Section 2.3 discusses optimization of the scaling factor and illustrates the construction of a set of optimal scaling factors across a family of models varying in both structure and classification difficulty. Section 2.4 provides the implementation of highdimensional bolstered resubstitution based on a family of optimal scaling factors.
2.1 Error estimation
In twogroup statistical pattern recognition, there is a feature vector X∈ℝ^{p} and a label Y∈{0, 1}. The pair (X,Y) has a joint probability distribution F, which is unknown in practice. Hence, a classifier is designed from training data, which is a set of n independent observations, S_{n}={(X_{1}, Y_{1}),…, (X_{n}, Y_{n})}, drawn from F. A classification rule is a mapping Ψ_{n} : {ℝ^{p}×{0, 1}}^{n}×ℝ^{p}→, where is the set of mappings from ℝ^{p} into {0, 1}. It maps S_{n} into a classifier ψ_{n} : ℝ^{p}→{0, 1}. The classification error ε_{n} is the probability of an erroneous classification:
where E_{F} denotes expectation with respect to F. Were F known, then the error could be found via Equation (3). In practice, one must use an error estimator . An error estimator can suffer from bias, Bias , and deviation variance, Var_{dev} = Var. These combine to contribute to the most common measure (used herein) for evaluating the accuracy of an error estimator, the rootmeansquare (RMS):2.1.1 Classical error estimation
The simplest way to estimate the error in the absence of independent test data is to compute its error directly on the sample data itself. This resubstitution estimator, , is usually optimistic (i.e. biased low), sometimes very much so.
In kfold crossvalidation, the dataset S_{n} is partitioned into k folds S_{n}^{(i)}, for i=1,…, k (for simplicity, we assume that k divides n). Each fold is left out of the design process and used as a test set, and the estimate, , is the overall proportion of error on all folds. A kfold crossvalidation estimator is unbiased as an estimator of ε_{n−n/k}. Crossvalidation estimators are pessimistic, since they use smaller training sets to design the classifier; however, their bias tends to be small. Their main drawback is their large variance (BragaNeto and Dougherty, 2004b; Devroye et al., 1996). Sometimes crossvalidation is repeated some number of times with different fold partitions and the results averaged. In this article, we use 10fold crossvalidation without repetition.
A recently developed estimation method, called adjusted bootstrap (), which carries out further bootstrap resampling in each fold, has been found to have good RMS performances (Jiang and Simon, 2007). Specifically, S_{n} is partitioned into n folds and, for each sample left out for testing, B bootstrap sample sets of size ln are drawn from the remaining n−1 points, l=1, 2,…, L. For each l, the error e_{l} is the proportion of misclassified samples across n folds and B bootstrap sample sets. Finally, the adjusted bootstrap error is computed in the form
where and are least squares estimates for the function and u_{l} is the proportion of the expected number of nonrepeated samples in a size ln bootstrap sample set.2.1.2 Bolstered error estimation
The empirical featurelabel distribution F^{*} is a discrete distribution that puts mass on each of the n available data points. The resubstitution estimator can be written in terms of the empirical featurelabel distribution as
Relative to F^{*}, no distinction is made between points near or far from the decision boundary. If one spreads the probability mass of the empirical distribution at each point, then variation is reduced because points near the decision boundary will have more mass on the other side of the boundary than will points far from the decision boundary. Consider a probability density function f_{i}^{⋄}, for i=1,…, n, called a bolstering kernel, and define the bolstered empirical distribution F^{⋄}, with probability density function given by
The bolstered resubstitution estimator (BragaNeto and Dougherty, 2004a) is obtained by replacing F^{*} by F^{⋄} in Equation (5) to obtain
Bolstering can be applied to other error estimators; however, we only use bolstered resubstitution, the bolstering method used the most to date.
The bolstered resubstitution estimator is given by
where A_{j}={x∣ψ(x)=j}. The integrals are the error contributions made by the data points, according to whether y_{i}=0 or y_{i}=1. If the classifier is linear, then the decision boundary is a hyperplane and it is usually possible to find analytical expressions for the integrals; otherwise, MonteCarlo integration can be employed.The amount of bolstering determines the variance and bias properties (hence, RMS also) of the bolstered estimator. As a general rule, wider bolstering kernels lead to lower variance estimators, but after a certain point this advantage becomes offset by increasing pessimistic bias. In the other direction, insufficiently wide kernels tend to result in optimistic bias. A zeromean, spherical Gaussian bolstering kernel f_{i}^{⋄} with covariance matrix of the form κ_{i}^{2}I, where I is the identity matrix, has been proposed (BragaNeto and Dougherty, 2004a), and has been shown to work well in lowdimensional feature spaces. Since bolstered estimators spread the test points, the task is to find the amount of spreading that makes the test points to be as close as possible to the true mean distance to the training data points. The true mean distance can be estimated by its samplebased estimate:
The estimate is the mean minimum distance between points belonging to class y. Next, let f_{i}^{⋄,1} be a unit−variance bolstering kernel, R_{i} be the random variable equal to the distance of a point randomly selected from f_{i}^{⋄,1} to the origin and F_{Ri}(x) be the cumulative distribution function of R_{i}. In the case of the bolstering kernel f_{i}^{⋄} with covariance matrix κ_{i}^{2}I, all distances get multiplied by κ_{i}. In BragaNeto and Dougherty (2004a), a single variance κ_{y}^{2} is estimated for all points from class y, such that the median distance of a test point to the origin is equal to the estimated true mean distance . This implies that half of the ‘mass’ (i.e. the ‘test points’) of the bolstering kernel will be farther from the center than and the other half will be nearer. Hence, κ_{y} is the solution of the equation . Letting α_{p}=F_{Ri}^{−1}(1/2), and recognizing that the R_{i} are identically distributed, the estimated SDs for the bolstering kernels are given by
for i=1, 2,…, n.2.2 Highdimensional bolstered resubstitution
In highdimensional settings, it is commonplace to perform feature selection and, when performed, feature selection is part of the classification rule, with the entire set of potential features constituting the feature set relative to the classification rule. Feature selection constrains the space of functions from which a classifier might be chosen, but it does not reduce the number of features in the design process. This is why when using crossvalidation error estimation, feature selection has to be carried out in each partitioned fold.
If we perform feature selection on a Ddimensional dataset S_{n}^{D} and arrive at a ddimensional set S_{n}^{d} (d < D), then the bolstered error estimator can use the previously defined kernel size κ_{i}, computed on S_{n}^{D}, not S_{n}^{d}. Specifically, the mean minimum distance is estimated on S_{n}^{D} and α_{p}=α_{D}. For high dimensions, we replace κ_{i} by
where k_{D} is an additional scaling factor determined by the dimension and where we have indicated the dimension in the mean minimum distance estimate. The idea is to adjust the kernel size by choosing k_{D} so the bolstered error estimator will be optimal (minimum RMS). k_{D}=1 yields the previously proposed kernel variance. In essence, κ_{i} is a parameter for the bolstered estimator and Equation (11) sets it free, thereby allowing for optimization. The situation is akin to 0.632 bootstrap as opposed to optimal bootstrap.Given the kernel sizes, the bolstered resubstitution error estimate is given by Equation (8) in D dimensions. For Gaussian kernels with independent variables, this integral reduces. Let f_{i}^{⋄,d}(x−x_{i}) and f_{i}^{⋄,D−d}(x−x_{i}) denote the Gaussian kernels in d and (D−d)dimensional spaces, respectively, so that the D dimensional Gaussian kernel decomposes as
Denoting x−x_{i} as Δx_{i}, then Equation (8) can be rewritten as where A_{j}^{d},j=0, 1, is the projection of the classifier decision region A_{j} into ddimensional space, and we added a superscript ‘D’ to the bolstered error estimator to indicate it refers to the error in Ddimensional space. The previous result indicates that the integrals necessary to find the bolstered error estimate in Ddimensional space can be equivalently carried out in d dimensional space. This is akin to resubstitution, where the error count is the same whether it is done in D or ddimensional space. For performance comparison purposes, we will also estimate the kernel size using only the lowdimensional data S_{n}^{d}, resulting in a bolstered error estimator , which uses the originally proposed kernel variance (no correction, or k_{D}=1). For feature selection, we will use sequential forward floating search (SFFS) (Pudil et al., 1994).2.3 Optimization method
To find the optimal kernel scaling factor k_{D}, we utilize the following procedure:
Protocol 1

Generate a sample set S_{n}^{D} of size n and a total of D features from a specified synthetic model.

Select a sized feature set A using a featureselection method F on S_{n}^{D}, resulting in a reduced dimension sample set S_{n}^{d} for the feature set A.

Design a classifier ψ_{n} for S_{n}^{d} according to the given classification rule Ψ_{n}.

Compute the true error ε_{n} using the underlying distribution of the model.

Compute the 10fold crossvalidation error (keeping in mind that feature selection must be repeated for each fold).

Compute the bolstered error .

Compute the bolstered errors for a list of kernel scaling factors k_{D,1}, k_{D,2},….

Calculate RMS for each error estimator by repeating Steps 1 through 7 a number N of times.

Repeat Steps 1 through 8 for different models M, different levels of model complexities and different classification rules Ψ_{n}.
We consider four data models, each a twoclass Gaussian model with equally likely classes and classconditional densities having covariance matrices Σ_{1} and Σ_{2}. One class mean is located at and the other at , with the location of depending on the model. The parameter δ is chosen to achieve prescribed values for the expected classification error E[ε_{n}]; different values of E[ε_{n}] represent different levels of difficulty at sample size n.
M1: A simple linear model in which Σ_{1}=Σ_{2}=I, the identity matrix, so that all features are uncorrelated. a_{i}=1 for i=1, 2,… d_{0} and uniformly distributed over [0, 1] for i=d_{0}+1, d_{0}+2,… D before all a_{i}'s are randomly permuted.
M2: Similar to M1 but with Σ_{1}=I and Σ_{2}=cI, where c is a constant and c ≠ 1. The Bayesian decision boundary is quadratic.
M3: This is a Block Covariance Model where all features are equally divided into G groups. The features from different groups are uncorrelated and the features from the same group possess the same correlation, ρ, among each other. The structure of the covariance matrix is
where Σ_{ρ} has 1 on the diagonal and ρ off the diagonal. Here a_{i}=1 for i=1, 2,… D.M4: Similar to M3, but with Σ_{1}=Σ_{G} and Σ_{2}=cΣ_{G}.
Table 1 gives a summary of the simulation experiments. Two limiting factors should be noted. First, the maximum total number of features, 500, is smaller than those often considered in microarray studies and, second, the number of selected features is kept to 5 or 10. There are three reasons for this, one pragmatic to our set of simulations and the others having to do with the nature of feature selection. The pragmatic reason is computational: we wish to do a large simulation study and therefore want to limit the computational burden. As for feature selection, given the sample sizes, it is prudent to keep the numbers of total and selected features small to have satisfactory feature selection (Sima and Dougherty, 2006a) and the number of selected features small to avoid the peaking phenomena (Hua et al., 2005, 2009). Regarding the total number of features, limiting the total number of features via prior biological knowledge or requirements on data quality raises the likelihood of finding good feature sets via feature selection (Zhao et al., 2010). Regarding the efficacy of selecting small feature sets, studies have shown that good classification can be achieved with two or three genes when reexamining data from studies that had originally used much larger feature sets (BragaNeto, 2007; Grate, 2005).
Data models  M1, M2, M3, M4  
Model difficulty  E[ε_{n}]  0.05, 0.10, 0.15 
Classification rules  Ψ_{n}  LDA,3NN, LSVM 
NNet,CART  
Featureselection methods  SFFS  
No. of repetitions  N  500 
No. of sample size  n  50, 100, 150 
No. of selected features  d  5, 10 
No. of total features  D  200, 500 
Kernel scaling factors  k _{ D }  0.2 to 2.0 in 0.2 increment 


d _{0}=10, G=20, c=2.25, ρ=0.25 
Data models  M1, M2, M3, M4  
Model difficulty  E[ε_{n}]  0.05, 0.10, 0.15 
Classification rules  Ψ_{n}  LDA,3NN, LSVM 
NNet,CART  
Featureselection methods  SFFS  
No. of repetitions  N  500 
No. of sample size  n  50, 100, 150 
No. of selected features  d  5, 10 
No. of total features  D  200, 500 
Kernel scaling factors  k _{ D }  0.2 to 2.0 in 0.2 increment 


d _{0}=10, G=20, c=2.25, ρ=0.25 
LDA, linear discriminant analysis; 3NN, 3nearestneighbor; LSVM, linear support vector machine; NNet, neural net; CART, classification and regression tree.
We plot the RMS versus kernel scaling factor k_{D} for , using all combinations of simulation parameters displayed in Table 1. Additionally, we compute the RMS for LDA with D=200, d=3 and n=50 for E[ε_{n}] =0.20, 0.25, 0.30, 0.35, 0.40 and 0.45. For comparison, RMS values for , and are also plotted, which appear as horizontal lines as they are not related to k_{D}. Here, we present some typical results, the complete set of plots appearing on the companion website. Note that due to the intensive computing in we only compute it for LDA with D=200, d=3 and n=50.
Figure 1 shows the result for LDA, n=50, and selecting d=3 out of D=200 features for nine values of E[ε_{n}]. Letting k_{D}^{min} denote the value of k_{D} achieving minimum RMS, we see that k_{D}^{min} increases for increasing expected error, the increase being slight for small expected errors but becoming significant for large expected errors (for E[ε_{n}] = 0.40 and 0.45, k_{D}^{min} is to the right of where we have stopped the plots at k_{D}=2.0; see extended plots to k_{D}=3.0 for these cases on the companion website). We observe that and typically perform better than , sometimes by a large margin. However, for an appropriate kernel scaling factor, outperforms and often outperforms by a wide margin. This improvement is achieved by a range of scaling factors and is robust across different models and complexities. Regarding model robustness, for a fixed value of E[ε_{n}] the RMS curves are remarkably similar; in particular, the value, k_{D}^{min}, of k_{D} achieving minimum RMS is consistent. In three cases, k_{D}^{min} remains fixed across the models and in the others it changes by not > 0.2. Moreover, in the latter cases, the RMS at the different values of k_{D}^{min} is approximately the same. The overall robustness has important practical implications, because in realworld problems we do not know the data model or its level of difficulty, but we do know the sample size n, the total and selected numbers of features, D and d, and the classification rule. As we will subsequently see, the fact that k_{D}^{min} is robust relative to the data model means that, in practice, we can derive a value of k_{D}, albeit not optimal, that can be used in for a better error estimator.
There is also robustness with respect to the classification rule and number of features. Figure 2a and b show robustness curves for 3NN and CART, respectively, for complexity E[ε_{n}]=0.10 (more on the companion website) for n=50, d=3 and D=200. The curves bear a strong resemblance to the corresponding curves for LDA in Figure 1 and for all models k_{D}^{min}=0.8, as with LDA. We again have LDA in Figure 2c for E[ε_{n}]=0.10 (more on the companion website), but now with D=500. Again there is resemblance to the corresponding case in Figure 1 and again k_{D}^{min}=0.8 for all models.
The preceding observations are mostly constrained to small samples. When n is large, the benefits of using tend to diminish. Figure 3a and b show RMS curves for LDA for n=100 and n=150, respectively, E[ε_{n}]=0.10 (more on the companion website), d=3 and D=200. If the model is known, an optimal is achievable, but robustness diminishes. For n=100, there is still some robustness, but for n=150, even a small deviation from k_{D}^{min} can result a worse performance than . Hence, for n=150, choosing an appropriate is not feasible in practice; however, since our interest is using bolstered error estimation for very small samples, this is not a significant drawback.
2.4 Implementation for real data
For practical application, based on the sample size, the total and selected numbers of features, and the classification rule, we will perform a modelbased analysis like the ones we have performed, thereby resulting in a lookup table of pairs (E[ε_{n}], k_{D}^{min}) as in Figure 1. To illustrate, by averaging across the four models, we obtain the following table for (E[ε_{n}],k_{D}^{min}): (0.05, 0.8), (0.10, 0.8), (0.15, 0.9), (0, 20, 1.0), (0.25, 1.2), (0.30, 1.3), (0, 35, 1.6). Upon designing a classifier from the data, we will obtain the 10fold crossvalidation error estimate, ε_{0}, and then, in the fashion of the method of moments, set E[ε_{n}]=ε_{0} and choose the corresponding value of k_{D}^{min} to serve as the scaling factor for bolstering. Since the lookup table is discrete, E[ε_{n}]=ε_{0} must be solved approximately by interpolation. Corresponding to the seven values of k_{D}^{min} in the lookup table for Figure 1, we have: k_{D}^{min}=0.8 for ε_{0}<0.125, k_{D}^{min}=0.9 for 0.125≤ε_{0}<0.175, k_{D}^{min}=1.0 for 0.175≤ε_{0}<0.225, k_{D}^{min}=1.2 for 0.225≤ε_{0}<0.275, k_{D}^{min}=1.3 for 0.275≤ε_{0}<0.325 and k_{D}^{min}=1.6 for 0.325≤ε_{0}. If one so desires, then a finer selection of expected errors and interpolation can be obtained. One might also use a coarser interpolation for computational purposes, with some loss of performance. In fact, that is precisely what we do here because we will subsequently perform a computationally intensive robustness analysis. Here we use: k_{D}^{min}=0.8 for ε_{0}<0.125; k_{D}^{min}=1 for 0.125≤ε_{0}<0.225; k_{D}^{min}=1.2 for 0.225≤ ε_{0}≤0.275; k_{D}^{min}=1.4 for 0.275≤ ε_{0}≤0.325; and k_{D}^{min}=1.6 for ε_{0}>0.325.
The final bolstered error estimate is computed from the data using this scaling factor. The success of the procedure depends on robustness in choosing a scaling factor because (i) the estimated model will be inaccurate owing to small sample size, (ii) crossvalidation has significant variance for small samples, (iii) the estimated model will differ to some extent from the models involved in creating the lookup table and (iv) the method of moments is not optimal. The following protocol is used to obtain the bolstered resubstitution error estimate:
Protocol 2
Given a sample set S_{n}^{D} with size n and dimension D, select a sized feature set A using a featureselection method F on S_{n}^{D}, resulting in a reduced dimension sample set S_{n}^{d} for the feature set A.
Design a classifier ψ_{n} for S_{n}^{d} according to the given classification rule Ψ_{n}, and compute the 10fold crossvalidation error estimate ε_{0}.
From the lookup table (E[ε_{n}], k_{D}^{min}) choose the kernel scaling factor k_{D}^{min} by setting E[ε_{n}]=ε_{0}.
Compute the bolstered error estimate using the selected scaling factor.
3 RESULTS AND DISCUSSION
To illustrate application, we have applied the method to two gene expression datasets:
Myeloma dataset: data are downloaded from the NIH Gene Expression Omnibus (GEO) under accession numbers GSE5900 and GSE2658, which contain 54 613 probe sets and 559 multiple myeloma (MM) samples, as well as 3 other subtypes [monoclonal gammopathy of undetermined significance (MGUS)], 44 samples; smoldering MM (SMM), 12 samples; healthy donors with normal plasma cell (NPC), 22 samples (Zhan et al., 2006). Samples are labeled into two classes, one for MGUS/SMM/NPC and the other for MM. Due to the significant unbalance of the samples between the two classes, only 156 samples are randomly selected from the 559 MM samples. The number 156 has been chosen as a compromise to take as many samples as possible from MM without significant unbalance between the two classes. Furthermore, only D=200 features with the largest variances across samples are selected from the total 54 613 probe sets. It is advantageous to limit ourselves to the 200 features with the largest variances, because these are more likely to reveal class discrimination and feature selection tends to perform poorly for very large numbers of features when samples are small (Sima and Dougherty, 2006a). Here we must put in a word of caution concerning the methodology. We are using feature variance to produce a set of 200 features to be taken as the full feature set for our performance analysis and will apply feature selection, classifier design and error estimation based on this set, including crossvalidation. In practice, this approach would be unacceptable, because the actual dataset to which we are applying datadependent feature selection is the full 54 613 probe sets. For instance, crossvalidation would have to use the variancebased feature reduction from the full 54 613 on each fold, else it would be optimistically biased. But that is not our goal here. We are a priori assuming that there are only 200 features to which we will apply data analysis. In practice, such a scenario would occur if the reduction to 200 were based on prior biological knowledge.
Breast cancer data set: data are from a microarraybased classification study that analyzes breast tumor samples from 295 patients (van de Vijver et al., 2002). Using a previously established D=70gene prognosis profile (van't Veer et al., 2002), a prognosis signature based on gene expression is proposed in van de Vijver et al. (2002) that correlates well with patient survival data and other clinical measures. Of the 295 microarrays, 115 belong to the ‘goodprognosis’ class and 180 belong to the ‘poorprognosis’ class. Referring to our cautionary comment regarding the multiple myeloma data, we note here that feature selection was used originally to obtain the 70 genes, but, again, from our performance perspective, that is not important for our analysis.
We consider sample size n=50 and d=3 features selected from the D=200 and D=70 features in the myeloma and breast cancer datasets, respectively, and LDA for classification. We repeatedly draw (stratified) n=50point samples with replacement from the empirical distribution (full dataset) as training data with the remaining sample points held out for true error estimation in computing the RMS (ε_{0} is still computed from the training data). The total number of repetitions is 200. The average true error and SD for the myeloma dataset are 0.2170 and 0.0309, respectively. For the breast cancer dataset, the average true error and SD are 0.2340 and 0.0362, respectively. Figure 4 shows the RMS for the two patient datasets. In both cases, performs significantly better. Owing to robustness of the optimal scaling factor, a coarse selection of expected errors and interpolation has proven sufficient.To further demonstrate the effectiveness of Protocol 2, we have applied it to four models in Figure 1: models M1 and M2 with expected errors 0.20 and 0.35. The performance graphs corresponding to Figure 4 are provided in Figure 5. Of particular interest are the scaling factors produced by the protocol. The average scaling factors for the four models are given by: M1, E[ε_{n}]=0.20 − average scaling factor 1.10; M1, E[ε_{n}]=0.35 − average scaling factor 1.39; M2 E[ε_{n}]=0.20 − average scaling factor 1.09; and M2, E[ε_{n}]=0.35 – average scaling factor 1.43. Referring to Figure 1, we see that all these averages are centered within the range of scaling factors where optimal bolstering outperforms .
3.1 Robustness to nonGaussian data
Although k_{D}^{min} is derived with Gaussian models, it is robust enough for models where this assumption is violated, as with the patient data, where the underlying distribution is almost certainly not Gaussian. To further investigate this issue, we take the model M2 in Section 2.3, but perturb the skewness and kurtosis of the class at the origin to obtain a Pearson system (Elderton and Johnson, 1969). Figure 6 shows the eight different distributions in the Pearson system with varying skewness and kurtosis. For the resulting model ^{p} and each skewness and kurtosis combination, where valid, we do the following:

Generate a sample set S_{n}^{D} of size n=50 and a total of D=200 features from the model ^{p}.

Feature select a sized=3 feature set A, resulting in a reduced dimension sample set S_{n}^{d}.

Design a classifier ψ_{n} for S_{n}^{d} using LDA.

Compute the true error ε_{n} using the underlying distribution of the model ^{p}.

Compute the 10fold crossvalidation error .

Compute using k_{D}^{min} from the previous section.

Calculate RMS for and by repeating Steps 1 through 6 a number N=400 of times.
Figure 7 shows the values of RMS for minus the RMS for for different skewness and kurtosis in a heatmap. Due to symmetry, only positive skewness is shown. In all cases, is superior to .
3.2 Concluding remarks
We have derived an optimal kernel scaling factor that can be used for bolstered error estimation in high feature dimensions. This bolstered error estimator achieves a significant RMS improvement over crossvalidation when samples are small, with continued, albeit smaller, performance improvement over the adjusted bootstrap. This superior performance is robust over a wide range of models. Hence, we have been able to incorporate optimality criteria from across a collection of families to arrive at suitable bolstering kernels for practical situations, thereby facilitating its use in applications like classification of genomic data when samples are small.
ACKNOWLEDGEMENTS
We would also like to thank the HighPerformance Biocomputing Center of TGen for providing the clustered computing resources used in this study; this includes the Saguaro2 cluster supercomputer, a collaborative effort between TGen and the ASU Fulton High Performance Computing Initiative.
Funding: National Science Foundation (CCF0634794 and CCF0845407); National Institutes of Health grant (1S10RR02505601) to Saguaro2 cluster, in part.
Conflict of Interest: none declared.
Comments