-
PDF
- Split View
-
Views
-
Cite
Cite
Kevin K. Dobbin, Richard M. Simon, Sample size planning for developing classifiers using high-dimensional DNA microarray data, Biostatistics, Volume 8, Issue 1, January 2007, Pages 101–117, https://doi.org/10.1093/biostatistics/kxj036
Close -
Share
Abstract
Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.
1 INTRODUCTION/MOTIVATION
The goal of many gene expression studies is the development of a predictor that can be applied to future biological samples to predict phenotype or prognosis from expression levels (Golub and others, 1999). This paper addresses the question of how many samples are required to build a good predictor of class membership based on expression profiles. Determining appropriate sample size is important as available clinical samples are often either very limited or costly to acquire and assay. As microarray studies move from the laboratory toward the clinic, the reason for developing the predictor is increasingly to assist with medical decisions (Paik and others, 2004), and the consequences of having a predictor that performs poorly because the sample size was too small can be serious.
Few methods have been published for developing genomic classifiers. Most publications on sample size determination for microarray studies are limited to the objective of identifying genes which are differentially expressed among the pre-defined classes. These have been reviewed by Dobbin and Simon (2005). Hwang and others (2002) addressed the objective of sample size planning for testing the global null hypothesis that no genes are differentially expressed, which is equivalent to testing the null hypothesis that no classifier performs better than chance. However, a sample size sufficient for rejecting the global null hypothesis may not be sufficient for identifying a good classifier. Mukherjee and others (2003) developed a learning curve estimation method that is applicable to the development of predictors but requires that extensive data already be available so that the learning curve parameters can be estimated. Fu and others (2005) developed a martingale stopping rule for determining when to stop adding cases, but it assumes that the predictor is developed sequentially one case at a time and does not provide an estimate of the number of cases needed.
The high dimensionality of microarray data, combined with the complexity of gene regulation, make any statistical model for the data potentially controversial. This has led some authors to avoid modeling the expression data directly, and instead model the general abstract learning process (Mukherjee and others, 2003; Fu and others, 2005). But a model for gene expression does not need to be exactly correct to be useful, and to provide insights into the classification problem that other more abstract approaches do not. We do not assert that the model presented here is exactly correct, but a useful oversimplification. Such oversimplifications are not uncommon in sample size determination methodologies: for example, models used to estimate sample sizes in clinical trials are often simpler than the planned data analysis. The simpler model is likely to have lower sensitivity and specificity, resulting in conservative sample size estimates. The simpler model also has the advantage that the resulting calculations will be more transparent, whereas sample size calculations based on the more complex model may be opaque and unconvincing.
A novel contribution of this paper is the integration of dimension-reduction into the framework of the normal model to calculate sample size for high-dimensional genomic data. We develop a novel methodology for calculating the significance level to be used in gene selection that will produce a predictor with the best resulting expected correct classification rate. We present methods for sample size calculation when one class is under-represented in the population. We also present novel results on how the size of the fold-change for differentially expressed genes, the noise level, and the number of differentially expressed genes, affect predictor development and performance.
Section 2 presents the predictive objective that will be used to drive sample size calculation. Section 3 presents the probability model for microarray data and the optimal classification rates. In Section 4, sample size algorithms are developed. In Section 5, the accuracy of the approximation formulas used in Section 4 are assessed, as well as their robustness to violations of model assumptions; also, the effect of model parameter combinations on sample size requirements and correct classification rates are examined. In Section 6, the robustness of the methodologies are assessed by application to a number of synthetic and real-world datasets that violate the model assumptions. In Section 7, results and recommendations are summarized.
2 THE SAMPLE SIZE OBJECTIVE
In a traditional class comparison study, the sample size objective is to achieve a certain power, say 95%, to detect a specified difference between the class means when testing at an significance level. Under the usual class comparison model assumptions, an adequate sample size will exist because the power goes to 1 as the sample size goes to infinity.
By analogy to the class comparison case, one might wish in the class prediction setting to establish a sample size large enough to ensure that the probability of correct classification (PCC) will be above, say, 95%. There are at least two problems with this objective. The first problem is that the probability of correct classification will depend on what samples are chosen for the training set; in other words, the probability of correct classification will not be a fixed quantity, but will be a random variable with some variance. Hence, for any sample size there will likely be some positive probability that the correct classification rate is below 95%. So we will instead consider the “expected value” of this random variable, where the expectation is taken over all possible sample sets of the same size in the population.
The second problem is that, unlike the power in a class comparison study (which always goes to 1 as the sample size goes to infinity), the PCC in a class prediction study will not necessarily go to 1 as the sample size goes to infinity. This is because for any two populations there may be areas in the predictor space where samples from each class overlap, so that class membership cannot be determined with confidence in these areas. An extreme example would be two identically distributed populations, where no predictor can be expected to do better than a coin toss (50%). Lachenbruch (1968) solved this problem by framing the question as: how large a sample size is required for the resulting predictive function “to have an error rate within of the optimum value?" For example, a of ensures that the expected probability of correct classification for the predictor will be within 10% of the best possible predictor. We will use an objective equivalent to Lachenbruch's, namely: determine the sample size required to ensure that the expected1 correct classification probability for a predictor developed from training data is within of the optimal expected correct classification probability for a linear classification problem2.
We would also note that although we will focus attention on this objective, the formulas developed here could also be used to ensure sensitivity and/or specificity above a specified target.
3 THE PROBABILITY MODEL
The general framework for the class prediction problem is that in some population of interest, individuals can be split into k disjoint classes, . The classes may correspond to different outcome groups (e.g. relapse in 5 years versus no relapse in 5 years), different phenotypes (e.g. adenocarcinoma versus squamous cell carcinoma), etc. A predictive model will be developed based on a training set T. For each individual in the training set, one observes that individual's class membership, and a data vector , the gene expression vector; the goal of the training set experiment is to develop a predictor of class membership based on gene expression, and possibly an estimate of the predictor's performance. Our goal is to determine a sample size for the training set.

The mean vector has the form , where represent the differentially expressed gene means.
The model stipulates that within each class, expression vectors follow a multivariate normal distribution with the same covariance matrix. It will be also assumed that differentially expressed genes are independent of undifferentially expressed genes. If is singular, then some genes are linear combinations of other genes (see, e.g. Rao, 1973, pp 527–8). Put another way, there are “redundant” genes with expression that is completely determined by the expression of other genes. Having these type of redundant genes is analogous to having an overparameterized linear model, and a model reduction transformation (Hocking, 1996, p 81) can eliminate the redundant genes, resulting in a non-singular covariance matrix. We will imagine these redundant genes have been eliminated, so that the covariance matrix is non-singular4. Marginal normality of the gene expression vectors may be considered reasonable for properly transformed and normalized data, although multivariate normality may be more questionable. However, in taking a model-based approach one must make some assumption and one weaker than multivariate normality is unlikely to lead to a tractable solution. The assumption of independence between differentially expressed and non-differentially expressed genes is not critical and is mainly made for mathematical convenience. Violations of this assumption will be evaluated.
A key issue is the size of the relative to the biological and technical variation present. A relatively small would correspond to a differentially expressed gene that is nearly uninformative—i.e. that will be of little practical use for prediction. We discuss these types of genes with numerous examples below and show that nearly uninformative genes, even if there are many of them (50–200 among thousands of genes), are generally of little use for predictor construction and sample size estimates should be based on more informative genes. If the biological reality is that all the differentially expressed genes are nearly uninformative, then we will see that no good predictor will result from the microarray analysis.
It will simplify presentation to assume that each differentially expressed gene has a common variance, and each undifferentially expressed gene has a common variance. In practice, genes are likely to have different variances. But the relationship between fold-change and gene variance determines the statistical power in the gene selection phase of predictor development. In order to keep this relationship intuitive, it is important to have a single variance estimate rather than a range of variance estimates. The single variance parameter can be considered a mean or median variance, in which case the targeted power will be achieved on average, with higher power for some genes and lower for others. More conservatively, the 90th percentile of the gene variances can be used, in which case the targeted power will be achieved even for genes exhibiting the highest variation (which may be the ones of most potential interest) across the population.
3.1 Notation
Throughout, will denote the marginal expected probability of correct classification taken over all samples of size n in the population. will denote the expected PCC for an optimal linear classifier.
In the population of interest, the proportion in class is , and the proportion in class is . The covariance matrix can be written in a partitioned form, with notation where is an correlation matrix for the differentially expressed genes and is the correlation matrix for the undifferentially expressed genes.
3.2 Optimal classification rates for the model



4 METHODS
Consider linear classifiers of the form: classify in class if and in class otherwise. The vector is estimated from the training set, and will depend on how genes are selected for inclusion in the predictor, and what prediction method is used. We will take a simple approach to predictor construction which does not weight the importance of the individual genes in the predictor. Each element of is 0 or 1; 1 indicates a gene determined to be differentially expressed by the hypothesis test of versus ; 0 indicates a gene determined not to be differentially expressed. This simple predictor is likely to have lower sensitivity and specificity than more sophisticated ones that assign weights to individual genes and we would by no means recommend people to use it. But, the sample sizes that we calculate this way should tend to be conservative (large).
Consider the hypothesis tests for gene selection described in the previous paragraph. These are tests of differential expression for each of the p genes. With each hypothesis test is associated a specificity, which will be denoted , and is the probability of correctly identifying a gene that is not differentially expressed; and also a sensitivity or power, which will be denoted , and is the probability of correctly identifying a gene as differentially expressed when in fact it is differentially expressed by a specified amount (). These hypothesis tests could be based on many different statistics. The calculations here will use two-sample t-tests.
4.1 Formulas for PCC(n)


Note that is the power associated with the gene-specific hypothesis tests that each gene is not differentially expressed among the classes, and the term under the final root sign is the true discovery rate (TDR)5.
In fact, power calculations can be used to eliminate from the equation (see Appendix C).
4.2 Sample size determination



This gives rise to the following algorithm for sample size determination:
Initialize .
.
For , use (C.1) to find the optimal for each m, by a linear search over . Then plug these s into (4.3) to get an upper bound on , call this .
Is ? If no, return to step 2. If yes, continue to step 5.
Use a sample of size n, with from each class.

4.3 Sample size for a test set
Once a predictor has been developed, an estimate of its performance is required. Such estimates are calculated either using a separate testing set of samples that were not used at all in the predictor development or by cross-validation applied to the training set. Advantages to having an independent test set are discussed in Simon and others (2003).
How best to split a collection of samples into a test set and training set is addressed in Molinaro and others (2005). Alternatively, one can use the methods developed here to determine the sample size required for the training set, then base the sample size for the test set on the width of the resulting confidence interval for the PCC. This is a valid approach because the independence of the test set implies that a binomial random variable correctly models the number of correct classifications. If L is the interval length, then the sample size formula6 for the test set is just , where is the estimated correct classification rate. For example, if , then results in sample sizes of , respectively.
4.4 Controlling the PCC in each class


If is close to 0 or 1, then this approach may lead to very large sample sizes. One option in this case is to oversample from the rarer class, so that for example half the samples are from each class. But resulting estimates of classification error will depend on external estimates of the proportion in each class, so oversampling is most appropriate for preliminary studies.
5 RESULTS
We first ran an extensive simulation to verify that the approximations used in the course of the derivation of the equation for the PCC produced good estimates. Table 1 shows the estimated probabilities of correct classification based on (4.4), and compares these to estimates of the population values based on Monte Carlo for a variety of combinations of values of effect size, , number of genes affected, m, and sample size, n. For the Monte Carlo-based estimates, data were generated according to the model specifications, then predictors developed as outlined in Section 4, and finally prediction accuracy assessed on an independent test set. As can be seen, the equation-based estimates are close to the Monte Carlo-based population estimates and, when different, tend to be conservative.
The expected PCC estimate is generally either accurate or conservative (underestimates the true PCC). Here, the number of genes is p = 10 000, 2δ/σ is the effect size for differentially expressed genes, n is the number of samples in the training set, m is the number of differentially expressed genes, and α is the significance level used for gene selection. Monte Carlo PCC estimates were calculated by generating a predictor based on a simulated sample, then generating 100 datasets from the same populations and calculating the prediction error rate of the predictor; this entire process was repeated 100 times and the average correct classification proportions appear in the table
| n | m | α | equation | Monte Carlo | ||
| 4 | 32 | 1 | 0.0001 | 0.92 | 0.96 | – 0.04 |
| 4 | 32 | 1 | 0.001 | 0.72 | 0.78 | – 0.06 |
| 4 | 32 | 1 | 0.01 | 0.58 | 0.59 | – 0.01 |
| 4 | 32 | 1 | 0.05 | 0.54 | 0.59 | – 0.05 |
| 4 | 32 | 1 | 0.1 | 0.53 | 0.54 | – 0.01 |
| 4 | 32 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.01 | 0.97 | 0.99 | – 0.02 |
| 4 | 32 | 10 | 0.05 | 0.81 | 0.99 | – 0.18 |
| 4 | 32 | 10 | 0.1 | 0.74 | 0.78 | – 0.04 |
| 3 | 70 | 1 | 0.0001 | 0.86 | 0.92 | – 0.06 |
| 3 | 70 | 1 | 0.001 | 0.67 | 0.81 | – 0.14 |
| 3 | 70 | 1 | 0.01 | 0.56 | 0.59 | – 0.03 |
| 3 | 70 | 1 | 0.05 | 0.53 | 0.52 | + 0.01 |
| 3 | 70 | 1 | 0.10 | 0.52 | 0.52 | – 0.02 |
| 3 | 70 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.01 | 0.92 | 0.98 | – 0.06 |
| 3 | 70 | 10 | 0.05 | 0.75 | 0.73 | + 0.02 |
| 3 | 70 | 10 | 0.10 | 0.68 | 0.74 | – 0.06 |
| n | m | α | equation | Monte Carlo | ||
| 4 | 32 | 1 | 0.0001 | 0.92 | 0.96 | – 0.04 |
| 4 | 32 | 1 | 0.001 | 0.72 | 0.78 | – 0.06 |
| 4 | 32 | 1 | 0.01 | 0.58 | 0.59 | – 0.01 |
| 4 | 32 | 1 | 0.05 | 0.54 | 0.59 | – 0.05 |
| 4 | 32 | 1 | 0.1 | 0.53 | 0.54 | – 0.01 |
| 4 | 32 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.01 | 0.97 | 0.99 | – 0.02 |
| 4 | 32 | 10 | 0.05 | 0.81 | 0.99 | – 0.18 |
| 4 | 32 | 10 | 0.1 | 0.74 | 0.78 | – 0.04 |
| 3 | 70 | 1 | 0.0001 | 0.86 | 0.92 | – 0.06 |
| 3 | 70 | 1 | 0.001 | 0.67 | 0.81 | – 0.14 |
| 3 | 70 | 1 | 0.01 | 0.56 | 0.59 | – 0.03 |
| 3 | 70 | 1 | 0.05 | 0.53 | 0.52 | + 0.01 |
| 3 | 70 | 1 | 0.10 | 0.52 | 0.52 | – 0.02 |
| 3 | 70 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.01 | 0.92 | 0.98 | – 0.06 |
| 3 | 70 | 10 | 0.05 | 0.75 | 0.73 | + 0.02 |
| 3 | 70 | 10 | 0.10 | 0.68 | 0.74 | – 0.06 |
The expected PCC estimate is generally either accurate or conservative (underestimates the true PCC). Here, the number of genes is p = 10 000, 2δ/σ is the effect size for differentially expressed genes, n is the number of samples in the training set, m is the number of differentially expressed genes, and α is the significance level used for gene selection. Monte Carlo PCC estimates were calculated by generating a predictor based on a simulated sample, then generating 100 datasets from the same populations and calculating the prediction error rate of the predictor; this entire process was repeated 100 times and the average correct classification proportions appear in the table
| n | m | α | equation | Monte Carlo | ||
| 4 | 32 | 1 | 0.0001 | 0.92 | 0.96 | – 0.04 |
| 4 | 32 | 1 | 0.001 | 0.72 | 0.78 | – 0.06 |
| 4 | 32 | 1 | 0.01 | 0.58 | 0.59 | – 0.01 |
| 4 | 32 | 1 | 0.05 | 0.54 | 0.59 | – 0.05 |
| 4 | 32 | 1 | 0.1 | 0.53 | 0.54 | – 0.01 |
| 4 | 32 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.01 | 0.97 | 0.99 | – 0.02 |
| 4 | 32 | 10 | 0.05 | 0.81 | 0.99 | – 0.18 |
| 4 | 32 | 10 | 0.1 | 0.74 | 0.78 | – 0.04 |
| 3 | 70 | 1 | 0.0001 | 0.86 | 0.92 | – 0.06 |
| 3 | 70 | 1 | 0.001 | 0.67 | 0.81 | – 0.14 |
| 3 | 70 | 1 | 0.01 | 0.56 | 0.59 | – 0.03 |
| 3 | 70 | 1 | 0.05 | 0.53 | 0.52 | + 0.01 |
| 3 | 70 | 1 | 0.10 | 0.52 | 0.52 | – 0.02 |
| 3 | 70 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.01 | 0.92 | 0.98 | – 0.06 |
| 3 | 70 | 10 | 0.05 | 0.75 | 0.73 | + 0.02 |
| 3 | 70 | 10 | 0.10 | 0.68 | 0.74 | – 0.06 |
| n | m | α | equation | Monte Carlo | ||
| 4 | 32 | 1 | 0.0001 | 0.92 | 0.96 | – 0.04 |
| 4 | 32 | 1 | 0.001 | 0.72 | 0.78 | – 0.06 |
| 4 | 32 | 1 | 0.01 | 0.58 | 0.59 | – 0.01 |
| 4 | 32 | 1 | 0.05 | 0.54 | 0.59 | – 0.05 |
| 4 | 32 | 1 | 0.1 | 0.53 | 0.54 | – 0.01 |
| 4 | 32 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 4 | 32 | 10 | 0.01 | 0.97 | 0.99 | – 0.02 |
| 4 | 32 | 10 | 0.05 | 0.81 | 0.99 | – 0.18 |
| 4 | 32 | 10 | 0.1 | 0.74 | 0.78 | – 0.04 |
| 3 | 70 | 1 | 0.0001 | 0.86 | 0.92 | – 0.06 |
| 3 | 70 | 1 | 0.001 | 0.67 | 0.81 | – 0.14 |
| 3 | 70 | 1 | 0.01 | 0.56 | 0.59 | – 0.03 |
| 3 | 70 | 1 | 0.05 | 0.53 | 0.52 | + 0.01 |
| 3 | 70 | 1 | 0.10 | 0.52 | 0.52 | – 0.02 |
| 3 | 70 | 10 | 0.0001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.001 | > 0.99 | 1.00 | ≈ 0 |
| 3 | 70 | 10 | 0.01 | 0.92 | 0.98 | – 0.06 |
| 3 | 70 | 10 | 0.05 | 0.75 | 0.73 | + 0.02 |
| 3 | 70 | 10 | 0.10 | 0.68 | 0.74 | – 0.06 |
We next consider choice of , the significance level for gene selection. can be chosen to maximize the resulting PCC. Plots of the as a function of for varying values of n are shown in Figure 1. The plots show that there is an optimal value on the interval. As the sample size n increases, the that maximizes the probability of correct classification decreases. This trend is intuitively reasonable because determines the cutoff value used in the gene selection step, and as the sample size gets large the power associated with small 's will increase. Also note from the plots that the larger the effect size (), the smaller the value that will maximize for fixed n. This makes sense since larger effect sizes will be easier to detect, so that one can afford to use more stringent 's to reduce the false-positive rate.
Plots of the estimated PCC as a function of , plotted for various values of n, based on (C.1). In each plot, “Effect” is defined as , is the number of differentially expressed genes, and p = 10 000 is the number of genes in vector . The sample sizes are, from left to right, , , and . The maximum points on the plots (indicated by vertical lines) are, for the top row, , , and and for the bottom row, , , and .
Plots of the estimated PCC as a function of , plotted for various values of n, based on (C.1). In each plot, “Effect” is defined as , is the number of differentially expressed genes, and p = 10 000 is the number of genes in vector . The sample sizes are, from left to right, , , and . The maximum points on the plots (indicated by vertical lines) are, for the top row, , , and and for the bottom row, , , and .
Figure 2 shows the sample sizes required as a function of , for and differentially expressed genes. Here, . The sample sizes were optimized7 over , and will ensure the expected is within of the best possible prediction rule. For a fixed effect size per gene, is an easier classification problem than , because the distance between the class means is . So the sample size requirements are smaller for . For an effect size of 1.5 (e.g. for a two-fold change in expression, and ), a sample size around 60 appears adequate.
Plot of effect size () versus sample size. Optimal used for determination of sample size. The sample sizes ensure that the average PCC is within of the best possible correct classification probability. Gene independence is assumed. Half the population is from each group, so that .
Plot of effect size () versus sample size. Optimal used for determination of sample size. The sample sizes ensure that the average PCC is within of the best possible correct classification probability. Gene independence is assumed. Half the population is from each group, so that .
We next turned to the question of the relationship between effect size, , the number of differentially expressed genes m, and the Mahalonobis distance between the class means (which determines the best possible classification rate). Table 2 shows examples where the effect size is small, so that each gene is nearly uninformative. For example, when the effect size is , and the number of informative genes is 200 or less, then one of two bad things will happen: either even an optimal predictor will have a poor , so that predictions will be unreliable or development of a predictor with within some reasonable bound will require prohibitively large sample sizes (500 or more). In the latter case, while a good predictor exists, the problem is too hard practically to work out with this technology. The table also shows that as the effect size gets larger, these issues go away. Hence, it is critical that at least some genes have a reasonably large effect size in order to develop a predictor. Therefore, for sample size calculations, one should assume that some genes do have reasonable effect sizes.
Impact of small effect sizes on the correct classification rate. PCC(∞) is the best possible correct classification rate, PCC(500) and PCC(200) are the correct classification rates with samples of sizes 500 and 200, respectively. For small effect size 0.2, a strong classifier exists only if many genes are differentially expressed; but even with many differentially expressed genes and a sample size of 500, the estimator performs poorly. For somewhat larger effect size 0.4, fewer differentially expressed genes are required for a good classifier to exist, but even n =200 samples do not result in estimates within 10% of the optimal classification rate. For effect size 0.6, the situation is more amenable if running a large sample is feasible. For reference, if σ =0.5, then an effect size of 0.6 corresponds to a 22δ=20.3=1.23 fold-change. Gene independence assumed here
| Differentially expressed genes | Effect size 2δ/σ | PCC(∞) | PCC(500) | PCC(200) |
| 50 | 0.2 | 0.76 | 0.60 | 0.54 |
| 100 | 0.2 | 0.84 | 0.64 | 0.56 |
| 150 | 0.2 | 0.89 | 0.67 | 0.57 |
| 200 | 0.2 | 0.92 | 0.70 | 0.59 |
| 10 | 0.4 | 0.74 | 0.71 | 0.62 |
| 20 | 0.4 | 0.81 | 0.79 | 0.68 |
| 30 | 0.4 | 0.86 | 0.84 | 0.72 |
| 40 | 0.4 | 0.89 | 0.88 | 0.75 |
| 10 | 0.6 | 0.83 | 0.83 | 0.79 |
| 20 | 0.6 | 0.91 | 0.91 | 0.88 |
| 30 | 0.6 | 0.95 | 0.95 | 0.93 |
| 40 | 0.6 | 0.97 | 0.97 | 0.95 |
| Differentially expressed genes | Effect size 2δ/σ | PCC(∞) | PCC(500) | PCC(200) |
| 50 | 0.2 | 0.76 | 0.60 | 0.54 |
| 100 | 0.2 | 0.84 | 0.64 | 0.56 |
| 150 | 0.2 | 0.89 | 0.67 | 0.57 |
| 200 | 0.2 | 0.92 | 0.70 | 0.59 |
| 10 | 0.4 | 0.74 | 0.71 | 0.62 |
| 20 | 0.4 | 0.81 | 0.79 | 0.68 |
| 30 | 0.4 | 0.86 | 0.84 | 0.72 |
| 40 | 0.4 | 0.89 | 0.88 | 0.75 |
| 10 | 0.6 | 0.83 | 0.83 | 0.79 |
| 20 | 0.6 | 0.91 | 0.91 | 0.88 |
| 30 | 0.6 | 0.95 | 0.95 | 0.93 |
| 40 | 0.6 | 0.97 | 0.97 | 0.95 |
Impact of small effect sizes on the correct classification rate. PCC(∞) is the best possible correct classification rate, PCC(500) and PCC(200) are the correct classification rates with samples of sizes 500 and 200, respectively. For small effect size 0.2, a strong classifier exists only if many genes are differentially expressed; but even with many differentially expressed genes and a sample size of 500, the estimator performs poorly. For somewhat larger effect size 0.4, fewer differentially expressed genes are required for a good classifier to exist, but even n =200 samples do not result in estimates within 10% of the optimal classification rate. For effect size 0.6, the situation is more amenable if running a large sample is feasible. For reference, if σ =0.5, then an effect size of 0.6 corresponds to a 22δ=20.3=1.23 fold-change. Gene independence assumed here
| Differentially expressed genes | Effect size 2δ/σ | PCC(∞) | PCC(500) | PCC(200) |
| 50 | 0.2 | 0.76 | 0.60 | 0.54 |
| 100 | 0.2 | 0.84 | 0.64 | 0.56 |
| 150 | 0.2 | 0.89 | 0.67 | 0.57 |
| 200 | 0.2 | 0.92 | 0.70 | 0.59 |
| 10 | 0.4 | 0.74 | 0.71 | 0.62 |
| 20 | 0.4 | 0.81 | 0.79 | 0.68 |
| 30 | 0.4 | 0.86 | 0.84 | 0.72 |
| 40 | 0.4 | 0.89 | 0.88 | 0.75 |
| 10 | 0.6 | 0.83 | 0.83 | 0.79 |
| 20 | 0.6 | 0.91 | 0.91 | 0.88 |
| 30 | 0.6 | 0.95 | 0.95 | 0.93 |
| 40 | 0.6 | 0.97 | 0.97 | 0.95 |
| Differentially expressed genes | Effect size 2δ/σ | PCC(∞) | PCC(500) | PCC(200) |
| 50 | 0.2 | 0.76 | 0.60 | 0.54 |
| 100 | 0.2 | 0.84 | 0.64 | 0.56 |
| 150 | 0.2 | 0.89 | 0.67 | 0.57 |
| 200 | 0.2 | 0.92 | 0.70 | 0.59 |
| 10 | 0.4 | 0.74 | 0.71 | 0.62 |
| 20 | 0.4 | 0.81 | 0.79 | 0.68 |
| 30 | 0.4 | 0.86 | 0.84 | 0.72 |
| 40 | 0.4 | 0.89 | 0.88 | 0.75 |
| 10 | 0.6 | 0.83 | 0.83 | 0.79 |
| 20 | 0.6 | 0.91 | 0.91 | 0.88 |
| 30 | 0.6 | 0.95 | 0.95 | 0.93 |
| 40 | 0.6 | 0.97 | 0.97 | 0.95 |
These considerations also lead to the following recommendation: when using conservative sample size approaches with , the estimated effect size should correspond to the estimated largest effect size for informative genes. This is the approach we used in Section 6 and seemed to perform adequately. Note also that we assume that p is in the 1000 to 22 000 range, that there are –200 or so genes differentially expressed. When either p or m values fall outside these assumptions, then our conclusions may not be valid. For example, microarrays with 10′s or 100′s of genes represented would require amendment of this approach.
Figure 3 shows plots of the probability of correct classification as a function of n. The lines represent different values of . One can see the values approaching their asymptotic values as n approaches 100.
Left plot is and right plot is m = 5. p = 10 000. Plot of sample size versus PCC for various values of the effect size . Gene independence is assumed. uses optimal . Population assumed evenly split between the classes, so .
Left plot is and right plot is m = 5. p = 10 000. Plot of sample size versus PCC for various values of the effect size . Gene independence is assumed. uses optimal . Population assumed evenly split between the classes, so .
6 EXAMPLES USING SYNTHETIC AND REAL MICROARRAY DATA
We tested the robustness of these methods using synthetic microarray data that violates the model assumptions, and real microarray data that likely do as well. Results are presented in Table 3. See table caption for detailed description of analysis.
Robustness evaluation of PCC estimates. Gene selection based on α =0.001, with support vector machine (SVM) and nearest neighbor (1-NN). The sample size is n, with n/2 taken from each class. All equation-based estimates use m =1. Synthetic datasets: generated from multivariate normal distribution with three differentially expressed genes with effect sizes given. p =1000 genes with block diagonal correlation structure compound symmetric with 20 blocks of 50 genes each and within-block correlation ρ, and each gene with variance 1. True PCC(n) for each classification algorithm estimated using n for classifier development and remainder for PCC(n) estimation, with 10 replicates of a sample size of 200, then averaging correct classifications. PCC(∞) estimated using cross-validation on a sample size of 200. Real datasets: taken from Golub and others (1999) and Rosenwald and others (2002). Predictor developed using n for classifier development and remainder for PCC(n) estimation. Process repeated 30(20) times for Golub (Rosenwald) datasets, and average probabilities of correct classification presented. Effect size 2δ/σ was estimated from the complete data using empirical Bayes methods. See Appendix E for details. PCC(∞) estimated based on cross-validation on complete datasets. Analyses were performed using BRB ArrayTools developed by Dr. Richard Simon and Amy Peng Lam
| Synthetic data | ||||||||
| Reality | ||||||||
| 2δ/σ | ρ | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| 3, 2, 1 | 0.8 | 24 | 0.07 | 0.00 | 0.00 | 0.86 | 0.92 | 0.92 |
| 1.5, 1.5, 1.5 | 0.8 | 100 | 0.07 | 0.00 | 0.00 | 0.70 | 0.90 | 0.85 |
| 1.5, 1.5, 1.5 | 0.8 | 24 | 0.16 | 0.13 | 0.12 | 0.61 | 0.77 | 0.73 |
| Real data | ||||||||
| Reality | ||||||||
| Dataset | Classes | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| Golub | AML‡/ALL§ | 40 | 0.21 | 0.01 | 0.00 | 0.75 | 0.96 | 0.96 |
| Golub | AML/ALL | 20 | 0.27 | 0.02 | 0.03 | 0.69 | 0.94 | 0.94 |
| Golub | AML/ALL | 10 | 0.45 | 0.09 | 0.10 | 0.51 | 0.87 | 0.87 |
| Rosenwald | 2-year vital | 100 | 0.06 | 0.00 | 0.02 | 0.50 | 0.55 | 0.54 |
| Rosenwald | GCB¶/non-GCB | 200 | 0.07 | 0.00 | 0.03 | 0.67 | 0.96 | 0.91 |
| Rosenwald | GCB/non-GCB | 100 | 0.07 | 0.02 | 0.04 | 0.67 | 0.94 | 0.90 |
| Rosenwald | GCB/non-GCB | 50 | 0.09 | 0.06 | 0.07 | 0.64 | 0.90 | 0.87 |
| Synthetic data | ||||||||
| Reality | ||||||||
| 2δ/σ | ρ | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| 3, 2, 1 | 0.8 | 24 | 0.07 | 0.00 | 0.00 | 0.86 | 0.92 | 0.92 |
| 1.5, 1.5, 1.5 | 0.8 | 100 | 0.07 | 0.00 | 0.00 | 0.70 | 0.90 | 0.85 |
| 1.5, 1.5, 1.5 | 0.8 | 24 | 0.16 | 0.13 | 0.12 | 0.61 | 0.77 | 0.73 |
| Real data | ||||||||
| Reality | ||||||||
| Dataset | Classes | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| Golub | AML‡/ALL§ | 40 | 0.21 | 0.01 | 0.00 | 0.75 | 0.96 | 0.96 |
| Golub | AML/ALL | 20 | 0.27 | 0.02 | 0.03 | 0.69 | 0.94 | 0.94 |
| Golub | AML/ALL | 10 | 0.45 | 0.09 | 0.10 | 0.51 | 0.87 | 0.87 |
| Rosenwald | 2-year vital | 100 | 0.06 | 0.00 | 0.02 | 0.50 | 0.55 | 0.54 |
| Rosenwald | GCB¶/non-GCB | 200 | 0.07 | 0.00 | 0.03 | 0.67 | 0.96 | 0.91 |
| Rosenwald | GCB/non-GCB | 100 | 0.07 | 0.02 | 0.04 | 0.67 | 0.94 | 0.90 |
| Rosenwald | GCB/non-GCB | 50 | 0.09 | 0.06 | 0.07 | 0.64 | 0.90 | 0.87 |
†Test set-based estimates.
‡AML: acute myeloid leukemia.
§ALL: acute lymphoblastic leukemia.
¶GCB: germinal-center B-cell-like lymphoma.
Robustness evaluation of PCC estimates. Gene selection based on α =0.001, with support vector machine (SVM) and nearest neighbor (1-NN). The sample size is n, with n/2 taken from each class. All equation-based estimates use m =1. Synthetic datasets: generated from multivariate normal distribution with three differentially expressed genes with effect sizes given. p =1000 genes with block diagonal correlation structure compound symmetric with 20 blocks of 50 genes each and within-block correlation ρ, and each gene with variance 1. True PCC(n) for each classification algorithm estimated using n for classifier development and remainder for PCC(n) estimation, with 10 replicates of a sample size of 200, then averaging correct classifications. PCC(∞) estimated using cross-validation on a sample size of 200. Real datasets: taken from Golub and others (1999) and Rosenwald and others (2002). Predictor developed using n for classifier development and remainder for PCC(n) estimation. Process repeated 30(20) times for Golub (Rosenwald) datasets, and average probabilities of correct classification presented. Effect size 2δ/σ was estimated from the complete data using empirical Bayes methods. See Appendix E for details. PCC(∞) estimated based on cross-validation on complete datasets. Analyses were performed using BRB ArrayTools developed by Dr. Richard Simon and Amy Peng Lam
| Synthetic data | ||||||||
| Reality | ||||||||
| 2δ/σ | ρ | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| 3, 2, 1 | 0.8 | 24 | 0.07 | 0.00 | 0.00 | 0.86 | 0.92 | 0.92 |
| 1.5, 1.5, 1.5 | 0.8 | 100 | 0.07 | 0.00 | 0.00 | 0.70 | 0.90 | 0.85 |
| 1.5, 1.5, 1.5 | 0.8 | 24 | 0.16 | 0.13 | 0.12 | 0.61 | 0.77 | 0.73 |
| Real data | ||||||||
| Reality | ||||||||
| Dataset | Classes | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| Golub | AML‡/ALL§ | 40 | 0.21 | 0.01 | 0.00 | 0.75 | 0.96 | 0.96 |
| Golub | AML/ALL | 20 | 0.27 | 0.02 | 0.03 | 0.69 | 0.94 | 0.94 |
| Golub | AML/ALL | 10 | 0.45 | 0.09 | 0.10 | 0.51 | 0.87 | 0.87 |
| Rosenwald | 2-year vital | 100 | 0.06 | 0.00 | 0.02 | 0.50 | 0.55 | 0.54 |
| Rosenwald | GCB¶/non-GCB | 200 | 0.07 | 0.00 | 0.03 | 0.67 | 0.96 | 0.91 |
| Rosenwald | GCB/non-GCB | 100 | 0.07 | 0.02 | 0.04 | 0.67 | 0.94 | 0.90 |
| Rosenwald | GCB/non-GCB | 50 | 0.09 | 0.06 | 0.07 | 0.64 | 0.90 | 0.87 |
| Synthetic data | ||||||||
| Reality | ||||||||
| 2δ/σ | ρ | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| 3, 2, 1 | 0.8 | 24 | 0.07 | 0.00 | 0.00 | 0.86 | 0.92 | 0.92 |
| 1.5, 1.5, 1.5 | 0.8 | 100 | 0.07 | 0.00 | 0.00 | 0.70 | 0.90 | 0.85 |
| 1.5, 1.5, 1.5 | 0.8 | 24 | 0.16 | 0.13 | 0.12 | 0.61 | 0.77 | 0.73 |
| Real data | ||||||||
| Reality | ||||||||
| Dataset | Classes | n | Our | SVM† | 1-NN† | Our | SVM† | 1-NN† |
| Golub | AML‡/ALL§ | 40 | 0.21 | 0.01 | 0.00 | 0.75 | 0.96 | 0.96 |
| Golub | AML/ALL | 20 | 0.27 | 0.02 | 0.03 | 0.69 | 0.94 | 0.94 |
| Golub | AML/ALL | 10 | 0.45 | 0.09 | 0.10 | 0.51 | 0.87 | 0.87 |
| Rosenwald | 2-year vital | 100 | 0.06 | 0.00 | 0.02 | 0.50 | 0.55 | 0.54 |
| Rosenwald | GCB¶/non-GCB | 200 | 0.07 | 0.00 | 0.03 | 0.67 | 0.96 | 0.91 |
| Rosenwald | GCB/non-GCB | 100 | 0.07 | 0.02 | 0.04 | 0.67 | 0.94 | 0.90 |
| Rosenwald | GCB/non-GCB | 50 | 0.09 | 0.06 | 0.07 | 0.64 | 0.90 | 0.87 |
†Test set-based estimates.
‡AML: acute myeloid leukemia.
§ALL: acute lymphoblastic leukemia.
¶GCB: germinal-center B-cell-like lymphoma.
The estimates of and based on our method are uniformly conservative across all datasets. The estimates are very conservative when applied to the Golub dataset. This might be expected since there are likely many genes with large effect sizes in these different tumor types, so that our assumption that there is only informative gene makes the problem much harder than it really is. The estimates are extremely conservative not only for the Golub dataset but also for several others as well. This is offset somewhat in the estimates since both quantities are biased in the same direction, causing some of the bias to cancel out of the difference.
One must be somewhat careful in the interpretation of Table 3, because although for smaller sample sizes like the and estimates, which are based on means over multiple simulations, appear uniformly conservative, there was also significant variation observed in the performance of classifiers in different simulations. For example, while the mean for the synthetic dataset on the first row of the table was 0.92 using SVMs, the worst classifier had an estimated correct classification rate of 0.79.
We applied both the gene independence-based method and the Ledoit and Wolfe eigenvalue estimation method to two other real microarray datasets (not shown) and found that resulting sample size estimates were similar in the two approaches, and in the 40–60 samples range.
In conclusion, our method tends to be conservative in estimating , , and for the datasets we examined, sometimes very conservative. So the method should be lead to adequate sample sizes while sometimes producing larger sample size estimates than are truly required. For example, our formulas are likely to significantly overestimate the required sample size for classification problems involving morphologically diverse tissues that are expected to have many differentially expressed genes with large effect sizes. In these cases, it may be advisable to follow the guidelines in Mukherjee and others (2003) instead.
7 CONCLUSION
We have presented novel methods for determining sample size for building predictors of class membership from high-dimensional microarray data. These methods take into account variability both in predictor construction and gene selection. These methods require only that two quantities be specified: the size of the fold-change relative to the variation in gene expression and the number of genes on the microarrays. We presented an alternative approach based on eigenvalue estimation. We investigated the robustness of our method on synthetic datasets that violated our model assumptions and on publicly available microarray datasets.
We found that sample sizes in the range of 20–30 per class may be adequate for building a good predictor in many cases. These results are similar to Mukherjee and others (2003). In general, we found that the sample size requirements for prediction are relatively small. We showed that the reason for this is that if a good gene-expression-based predictor exists in the population, then it is likely that some genes exhibit significant differential expression between the classes relative to the biological noise present. Hence, adequate power to identify these genes can be achieved without a large number of microarrays. One drawback of our approach is that it controls the expected probability of correct classification to be within some tolerance of the optimal, but does not control the actual probability of correct classification; for small sample sizes, the probability of correct classification may be highly variable depending on what samples are chosen for the training set, so that our method may not give adequate control and should be used with caution in these situations.
We identified scenarios in which either no good predictor exists or it is practically not feasible to construct one. No good predictor may exist when differential expression is small relative to biological variation—and this may be the case even when as many as 100 genes are differentially expressed. We further found that even if enough genes are differentially expressed to construct a reasonable predictor (in theory), if the fold-changes for differentially expressed genes are uniformly small relative to the biological variation, then identification of differentially expressed genes and construction of a good predictor will probably not be feasible.
We investigated both the cases when each class is equally represented in the population and when they are not equally represented. We presented methods for controlling the overall PCC and for controlling the PCC in the worst group in these situations (sensitivity and specificity).
The eigenvalue-based estimation method presented is quite preliminary and we would in general recommend that the independence assumption approach be used instead. One problem with the eigenvalue approach is that there will be uncertainty about the eigenvalue estimates. Another problem is that these formulas are theoretical worst-case-scenarios bounds that may in fact be much worse than reality, and therefore lead to too conservative estimate for the sample size. Additionally, Table 3 suggests that the gene independence method may be conservative enough already, even when it is violated.
A method for finding the optimal significance level to use for gene selection when developing a predictor was presented, and approaches to determining sample size for a testing set discussed. The optimal significance levels tend to be on par with those generally used in microarray experiments (e.g. ). We further showed that the probability of correct classification depends critically on the power and the true discovery rate, so that gene selection methods that control the false discovery rate should produce good predictors.
Conflict of Interest: None declared.
APPENDICES
A.1 Appendix A
Assume (3.1) where is positive definite, and has the form , where indicates an covariance matrix. For an x randomly selected from the population of interest , define . Then, follows.




A.2 Appendix B




A.3 Appendix C
In this appendix, we present a method for finding the optimal for fixed n.


A.4 Appendix D




A.5 Appendix E
The empirical Bayes method used in Table 3 is presented. Let where is the median (across genes) of the estimated pooled within-class variance. Let be the sample variance of the estimated effect sizes, . Then, and . See, e.g. Carlin and Louis (1996). The largest estimated effect size was used.
References
The expectation is taken over all training samples of size n in the population.
Under the assumptions of the homogeneous variance multivariate normal model.
A differentially expressed gene is defined as a gene with different average expression in the different classes.
Note that assuming is non-singular is not the same as assuming the estimated covariance matrix S is non-singular. S will usually be singular because of the “large p small n” issue, i.e. because there are many more genes than samples.
Technically, this is the approximate TDR, the expected value of the true discovery proportion (TDP), which is one minus the false discovery proportion (FDP). Let be the number of false discoveries when the sample size is n, and the number of true discoveries. Then, .
The formula is valid when and , which must be verified.
For these plots, the optimal was calculated for all even sample sizes from 2 to 100, and then for each possible sample size the corresponding optimal was used to estimate the mean .





