Effect of separate sampling on classification accuracy.

MOTIVATION
Measurements are commonly taken from two phenotypes to build a classifier, where the number of data points from each class is predetermined, not random. In this 'separate sampling' scenario, the data cannot be used to estimate the class prior probabilities. Moreover, predetermined class sizes can severely degrade classifier performance, even for large samples.


RESULTS
We employ simulations using both synthetic and real data to show the detrimental effect of separate sampling on a variety of classification rules. We establish propositions related to the effect on the expected classifier error owing to a sampling ratio different from the population class ratio. From these we derive a sample-based minimax sampling ratio and provide an algorithm for approximating it from the data. We also extend to arbitrary distributions the classical population-based Anderson linear discriminant analysis minimax sampling ratio derived from the discriminant form of the Bayes classifier.


AVAILABILITY
All the codes for synthetic data and real data examples are written in MATLAB. A function called mmratio, whose output is an approximation of the minimax sampling ratio of a given dataset, is also written in MATLAB. All the codes are available at: http://gsp.tamu.edu/Publications/supplementary/shahrokh13b.


INTRODUCTION
The medical community is being confronted with serious problems of reproducibility in the development of biomarkers.The issue has been highlighted by a recent report regarding comments by Janet Woodcock, FDA drug division head.The report states, 'Based on conversations Woodcock has had with genomics researchers, she estimated that as much as 75 percent of published biomarker associations are not replicable.''This poses a huge challenge for industry in biomarker identification and diagnostics development,'' she said (Ray, 2011).'Many issues affect reproducibility, including the measurement platform, specimen handling, data normalization and sample compatibility between the original and subsequent studies.These matters concern experimental procedures and are not our concern here; rather, we are interested in the methodology for designing classifiers.One issue in this regard is the impact of inaccurate error estimation owing to small samples.This has been previously quantified (Yousefi and Dougherty, 2012).Here we are interested in a different problem, one that will confront us even if we have large samples and perfect error estimation: the effect of having predetermined sample sizes so that sampling is not random.
In classification studies it is typically a tacit assumption that sampling is random; indeed, it is commonplace for this assumption to be made throughout a text on classification.For instance, Devroye et al. declare on page 2 of their text that all sampling is random (Devroye, 1996).The assumption is so pervasive that it can be applied without mention.With regard to the problem at hand, Duda et al. (2001) state, 'In typical supervised pattern classification problems, the estimation of the prior probabilities presents no serious difficulties.'But, in fact, there are often serious difficulties.
Under the assumption of random sampling, the data set, S n ¼ {(X 1 , Y 1 ), . . ., (X n , Y n )}, is drawn independently from a fixed distribution of feature-label pairs, (X, Y); in particular, this means that if a sample of size n is drawn for a binary classification problem, then the numbers of sample points, n 0 and n 1 , in classes 0 and 1, respectively, are random variables such that n 0 þ n 1 ¼ n.An immediate consequence of the random-sampling assumption is that the prior probability c ¼ Pr(Y ¼ 0) can be consistently estimated by the sampling ratio, namely, by ĉ ¼ n0 n .Consistency is nothing but Bernoulli's weak law of large numbers, namely, n0 n !c in probability.Thus, if the sample is large, we can expect the sampling ratio to be close to the prior probability.
Suppose the sampling is not random, in the sense that the ratios n0 n and n1 n are chosen prior to sampling.In this 'separate (stratified) sampling' case, S n ¼ S n0 [ S n1 , where the sample points in S n0 and S n1 are selected randomly from Å 0 and Å 1 but, given n, the individual class counts n 0 and n 1 are not random.Then, in effect, we have no sensible estimate of c.One could let ĉ ¼ n0 n , but there would be no reason to do so.
Since our aim is to use the data to train a classifier, does the inability to consistently estimate c matter?Clearly in the case of linear discriminant analysis (LDA) it does, since the LDA classifier is defined by n ðxÞ ¼ 1 if D samp ðxÞ 0 and n ðxÞ ¼ 0 if D samp ðxÞ40, where and l0 and l1 are the sample means of the class-conditional populations Å 0 and Å 1 , respectively, and D is the pooled sample covariance matrix.The rationale for the LDA discriminant is that the estimators converge to the population parameters *To whom correspondence should be addressed.
as n ! 1, in which case the resulting discriminant, D Bayes ðxÞ, defines the Bayes (optimal) classifier in the two-class Gaussian model with common covariance matrix.It is obvious from Equation (1) that an estimate of c is required for LDA and a bad choice of ĉ will negatively impact the classifier.This fact, which is a consequence of separate sampling, has long been recognized (Anderson, 1951).
The situation is less transparent with model-free classification rules such as support vector machines.In this article we use simulation to study the effect of separate sampling on several different classification rules, where the role of c does not appear explicitly in classifier learning.We generate separate samples with different ratios r ¼ n0 n and consider the expected error, E½" n jr, of the designed classifier, given r, where the error of classifier n is defined by " n ¼ Prð n ðXÞ 6 ¼ YÞ, the probability of misclassification.We will see that the penalty for separate sampling without knowledge of c can be severe.
With random (or, 'mixed') sampling, rather than being fixed prior to sampling, r is a sample-dependent random variable.In this case, E½" n jr denotes the expectation of the error conditioned on r and the expected classification error is given by E½" n ¼ E r ½E½" n jr, where the outer expectation is relative to the distribution of r.The classifier error is likely to be smaller when the sampling ratio r is close to c. Hence, if one happens to fix r sufficiently close to c, then E½" n jr5 E½" n .Because r !c in probability as n ! 1 for mixed sampling, as n gets larger the distribution of r gets more tightly concentrated around c, so that the distribution of E½" n jr (as function of r) gets more tightly packed around E½" n , which in turn means that to have E½" n jr5 E½" n one must choose r very close to c.To illustrate this phenomenon, consider 2D Gaussian class-conditional densities with means at (0.3, 0.3) and (0.8, 0.8), possessing common covariance matrix 2 I, where I is the identity matrix and 2 ¼ 0:4, and with c ¼ 0:6: For this model, the Bayes error is " Bayes ¼ 0:27.Figure 1 shows the difference E½" n À E½" n jr for different values of r and different sample sizes when using LDA.If r ¼ 0:6, then E½" n jr5 E½" n for all n.If r ¼ 0:7, which is fairly close to c ¼ 0:6, then E½" n jr5 E½" n for n 25.Notice the lack of symmetry, both 0.5 and 0.7 being equally close to c.This should not be surprising because we should not expect the distribution of E½" n jr to be symmetric.
Let us examine Figure 1 from the practitioner's perspective.Suppose that cost limits the sample to a given size n.If the sample is random, then the expected error of the designed classifier will be E½" n , which is unknown since the feature-label distribution is unknown.Consider three cases: (i) if c is accurately known from existing population statistics regarding the two classes, say BRCA1 and BRCA2 breast cancer, then no matter what the sample size, it is best to do separate sampling with n 0 % cn; (ii) if c is approximately known, meaning that the practitioner believes that c is close to c 0 , then, for small n, it may be best, or at least acceptable, to do separate sampling with n 0 % c 0 n, and the results will likely still be acceptable for large n, though not as good as with random sampling; (iii) if the practitioner has no idea what c is, then sampling must be random because the penalty for separate sampling can be very large.While, at this point, these comments refer specially to Figure 1, which is for LDA, a salient point to be made in this article is that they are quite general and, moreover, can be extended to the commonplace separate sampling situation where one cannot choose n 0 and n 1 .
Why is all of this a major issue for bioinformatics?Simply put separate sampling is ubiquitous in bioinformatics, in particular, with genomic classification, where a standard approach is to take tissue samples from two classes, say, different types of cancer or different stages of cancer, for which the number of specimens in each class is not chosen randomly, and then to design a classifier.The Supplementary Material lists 20 published studies using separate sampling.In each case we give the classification problem, sample sizes, classification rule and error estimator.Even if an error estimate is exact for the problem at hand-that is, for the sampling ratio represented by the data-what does it mean relative to the classification error for future observations (say, patients)?That depends on the true prior probabilities, which we do not know.

Effect of sampling ratio-synthetic data
We employ simulation to study the effect of the sampling ratio for different classification rules using a general model based on multivariate Gaussian distributions with a blocked covariance structure.This model conforms to the setting where blocks represent correlated gene groups, say common pathways, and between-block correlation is negligible (Doughtery et al., 2007;Hua et al., 2005;Shmulevich and Dougherty, 2007).The model has several parameters that can generate a battery of covariance matrices.For example, a 4-block covariance matrix with block size 3 has the structure in which 2 y is the variance of each variable and the i , i 2 f1, 2, 3, 4g, are the correlation coefficients inside blocks.We consider both identical and unequal covariance matrices.We assume common correlation coefficient, i ¼ , i 2 f1, 2, 3, 4g.
A typical microarray or next-gen RNA sequencing (Mortazavi et al., 2008;Wang et al., 2009) experiment yields expressions for thousands of genes, but a small number of sample points, typically 5200.Therefore, data-based feature selection is typically employed; however, since our sole aim is to study the effect of the ratio r on the expected true error, we do not consider feature selection and assume a model containing a reasonable number, D, of features (which is equivalent to assuming that a set of D genes has been chosen by the researcher based on prior biological knowledge).We let D ¼ 15.Two covariance matrix settings are considered: identical covariance matrices, 2 0 ¼ 2 1 ¼ 0:4, and unequal covariance matrices, 2 0 ¼ 0:4, 2 1 ¼ 1:6, with block size l ¼ 5 and correlation coefficient ¼ 0:8 corresponding to tight correlation within a block.The parameter settings are summarized in Table 1.

Effect of sampling ratio-real data
Four microarray real datasets are used: pediatric acute lymphoblastic leukemia (ALL) (Yeoh et al., 2002), acute myeloid leukemia (AML) (Valk et al., 2004), multiple myeloma (Zhan et al., 2006) and breast cancer (Desmedt et al., 2007).We follow the data preparation instructions reported in the cited articles.The properties of these datasets are summarized in Table 2.The right-most column in Table 2 contains the initial feature size, number of sample points in classes 0 and 1, respectively, from left to right.The Supplementary Material provides detailed descriptions of these datasets.The same classification rules as those used for the synthetic data are applied to the real data.T-test feature selection is used to reduce the original set of genes down to D ¼ 15.

Holdout error estimation
Because we are going to use real data, we wish to use holdout error estimation; however, the standard holdout procedure, which is unbiased with random sampling, become biased, perhaps severely so, with separate sampling.Therefore, we redefine holdout for separate sampling.
The true error of a designed classifier n is given by Relative to a random sample, S n , the expected true error is For standard holdout error estimation, the sample is split into t points (the training set) to train the classifier and m points (the test set) to estimate the error, where in this scenario the notation indicates that the total sample size is n ¼ t þ m.Let S t , S m , S m0 , and S m1 denote the set of training data, the full set of test data, the class-0 test points, and the class-1 test points, respectively.The holdout estimator is where "0 and "1 denote the holdout estimators of " 0 n and " 1 n .Taking expectations in ( 6) yields Because the test data are independent from the training data, the holdout estimator is unbiased given the training data, which means that E Sn ½ "jS t ¼ " n : Taking the expectation relative to the training data yields E Sn ½ " ¼ E St ½E Sn ½ "jS t ¼ E Sn ½" n : Similar expressions apply to "0 and "1 , namely, E Sn ½ "0 ¼ E Sn ½" 0 n and E Sn ½ "1 ¼ E Sn ½" 1 n : Thus, ", "0 , and "1 are unbiased estimators of " n , " 0 n , and " 1 n , respectively, and Expression (7) corresponds term by term to (5).
With separate sampling, taking expectations in (6) yields because the ratio m0 m is fixed.Hence, " is not unbiased.The bias depends on the difference between c and m0 m : If c is known, then the holdout estimator can be redefined as for both random and separate sampling.In both cases it is unbiased: taking expectations on the right-hand side of ( 9) yields the right-hand Table 1.Distribution model parameters

Parameters
Value/description  5).(We are unaware of this approach to holdout error estimation in the case of separate sampling having been previously used.)Absent an independent estimate of c, use of ( 6) for separate sampling is unacceptable because it is biased and the extent of the bias is unknown.We use ( 9) for our real-data examples.

Implementation
For a given synthetic model parameter setting, sample size n, ratio r and classification rule É, we approximate the expected true error rate E Sn ½" n ðcÞjr via Monte-Carlo (MC) simulation.Each repetition of the MC simulation is depicted in Figure 2(a).The first set of experiments is done using the flowchart in Figure 2(a).In these experiments, the sample size, n, is fixed but the class sample sizes vary according to the sampling ratio r.Samples S n are generated using the model determined by ðl 0 , D 0 Þ and ðl 1 , D 1 Þ described in Table 1 in accordance with r and n.Assuming a classification rule É, a classifier is trained.The last stage in Figure 2(a) is the true error computation for the designed classifiers, which is also done via MC with 10 000 repetitions.The whole procedure is repeated 5000 times.
The real datasets in Table 2 are sufficiently large to be divided for training and testing and we use holdout error estimation, as previously described for separate sampling.The procedure is graphically illustrated in Figure 2(b).
Fixing the total sample size n, assuming different values for the parameter r, we choose n 0 ¼ rðn À mÞ AE Ç points from class 0 and n 1 ¼ ðn À mÞ À n 0 from class 1, where a d e is the smallest integer greater than or equal to a.The remaining data points are used for holdout estimation.The expected holdout estimate, E½ "n ðcÞjr, is computed via MC approximation.The process is repeated 3000 times.

RESULTS AND DISCUSSION
The full set of results appears in the Supplementary Material.Herein, we provide some results covering a variety of cases.We show results of four classification rules: 3NN, 5NN, L-SVM and RBF-SVM.Results for the synthetic examples with dimension 15 are shown.Also, we only provide the results for n ¼ 100 and n ¼ 80, for the synthetic and real datasets, respectively.For the real datasets, a two-sample t-test is used to reduce the dimensions in Table 2 to D ¼ 15.

Expected true error
The expected true errors for synthetic data with common covariance matrix are given in Figure 3(a)-(d), where each plot gives the expected true error versus the parameter r for different class prior probabilities, i.e. c 2 f0:3, 0:4, 0:5, 0:6, 0:7g: Similar behavior is observed, regardless of classification rule.There is, however, different sensitivsities of different rules to the sampling ratio r.
Results for the second model with different covariance matrices are shown in Figure 3(e)-(h).In this case, there is more varied behavior among the classification rules.Note the point in each figure where the curves cross.This point corresponds to the minimax solution and will be discussed in detail when we analyze the properties of the curves in a later subsection.For equal covariance matrices, the point is close to 0.5 for all the classification rules; however, it can be far from 0.5 for unequal covarinace matrices, depending on the classification rule.

Properties of the error curves
The most obvious characteristic of the error curves is that in each figure they appear to cross at a single value of r.We will now examine this phenomenon.We must be careful because the figures show continuous curves but r is a discrete variable.Hence, we will have to carefully define what it means to 'cross'.
According to the standard definition, a classification rule is said to be 'smart' if the expected value of the error is monotonically decreasing as a function of sample size for all feature-label distributions (Devroye, 1996).This is in accord with the intuition (not always correct) that more data cannot hurt classifier design.We adapt the notion of smartness to the present circumstances by defining a classification rule to be 'class-wise smart' relative to a family of feature-label distributions f c ðx, yÞ ¼ cfðxj0Þþ ð1 À cÞfðxj1Þ, c 2 ½0, 1, if, for all c 2 ½0, 1 and r 2 4r 1 , E½" 0 n jr 2 E½" 0 n jr 1 and E½" 1 n jr 2 !E½" 1 n jr 1 : Intuitively, r 2 4r 1 means that there are more data available from class 0 when designing the classifier when conditioning on r 2 than when conditioning on r 1 , so that one would intuitively expect that the class-0 error when conditioning on r 2 is not greater than when conditioning on r 1 , which is what is stated by the first inequality.The situation reverses relative to the class-1 error and that is what is stated by the second inequality.A classification rule is 'strictly class-wise smart' if r 2 4r 1 implies E½" 0 n jr 2 5E½" 0 n jr 1 and E½" 1 n jr 2 4E½" 1 n jr 1 : In Figure 3 we observe that, if c 2 4c 1 , then for sufficiently small r, E½" n ðc 2 Þjr4E½" n ðc 1 Þjr, where the notation E½" n ðcÞjr indicates that the error is with respect to f c ðx, yÞ: To obtain intuition behind this phenomenon, decompose the error into class-wise errors: For small r, the preponderance of data for classifier design is in class 1 with very little data in class 0. Hence, the class-0 error tends to be greater than the class-1 error, so that In these plots n 0 þ n 1 ¼ 100 is fixed, where n 0 and n 1 are chosen according to the ratio r.The third and the fourth rows show the expected holdout error estimate of the datasets (Yeoh et al., 2002) and (Valk et al., 2004), respectively, where n 0 ¼ 80r d e and n 1 ¼ 80 À n 0 points are randomly selected from classes 0 and 1, respectively, as the training set.The rest of points are held out for error estimation computed via (9) Analogously, for sufficient large r, E½" n ðc 2 Þjr À E½" n ðc 1 Þjr50: ð12Þ We shall assume that whatever classification rule and featurelabel distribution we are considering, ( 11) and ( 12) hold for sufficiently small and sufficiently large r, respectively.The next lemma, whose proof is in the Supplementary Material, states a fundamental property of the error curves.LEMMA 3.2.1If a classification rule is strictly class-wise smart relative to the family ff c ðx, yÞg, then, for c 2 4c 1 , E½" n ðc 2 ÞjrÀ E½" n ðc 1 Þjr is a strictly decreasing function of r.
If we only assume class-wise smart, then E½" n ðc 2 ÞjrÀ E½" n ðc 1 Þjr would only be sure to be a decreasing function of r.
The next lemma, whose proof is in the Supplementary Material, shows that constraining c results in a corresponding constraining of the expected error.LEMMA 3.2.2Suppose c 1 5c 2 5c 3 : If E½" n ðc 3 Þjr !E½" n ðc 1 Þjr, then E½" n ðc 3 Þjr !E½" n ðc 2 Þjr !E½" n ðc 1 Þjr: If E½" n ðc 3 Þjr E½" n ðc 1 Þjr, then E½" n ðc 3 Þjr E½" n ðc 2 Þjr E½" n ðc 1 Þjr: To ease notation, we will say that E½" n ðc 2 Þjr is between E½" n ðc 1 Þjr and E½" n ðc 3 Þjr if either E½" n ðc 3 Þjr !E½" n ðc 2 Þjr !E½" n ðc 1 Þjr or E½" n ðc 3 Þjr E½" n ðc 2 Þjr E½" n ðc 1 Þjr: The salient proposition concerning the error curves involves a strictly decreasing function g(r) of r that is positive for sufficiently small values of r and negative for sufficiently large values of r.Since r is a discrete variable in ð0, 1Þ, we have a sequence of values 05r 1 5 . . .5r nÀ1 51: If g were continuous, then there would exist a unique value r Ã such that gðr Ã Þ ¼ 0, gðrÞ40 for r5r Ã , and gðrÞ50 for r4r Ã .But since g is discrete, this basic proposition is slightly altered.Rather, there are two possibilities: (i) there exists a unique value r Ã ¼ r j for some value j such that gðr Ã Þ ¼ 0, gðrÞ40 for r5r Ã , and gðrÞ50 for r4r Ã or (ii) there is a unique value r j such that gðrÞ40 for r r j and gðrÞ50 for r !r jþ1 : In the second case, we select a point r Ã 2 ðr j , r jþ1 Þ, say, the mid-point, and then we have gðrÞ40 for r5r Ã and gðrÞ50 for r4r Ã , as in the first case.In the next theorem, whose proof is in the Supplementary Material, we will be interested in a 'unique' point r Ã 2 ð0, 1Þ: For the second case, we interpret this to mean that there is a unique interval ðr j , r jþ1 Þ and r Ã is the selected point in that interval.
Given the preceding discrete interpretation, we shall say that a function pðrÞ 'crosses' function q(r) at r Ã if pðrÞ !qðrÞ for r5r Ã and if pðrÞ qðrÞ for r4r Ã : THEOREM 3.2.3If a classification rule is strictly class-wise smart relative to ff c ðx, yÞg, then there exists a unique point r Ã such that for any c 2 4c 1 , E½" n ðc 2 Þjr crosses E½" n ðc 1 Þjr at r Ã : This is precisely the theorem we want because it means that all error curves cross at r Ã : In the error curves of Figure 3, we observe that r Ã provides a minimax value; that is, r mm ¼ r Ã yields the minimum value of E½" n ðcÞjr when taking the maximum error over all values of E½" n ðcÞjr for r 2 ð0, 1Þ : where we must keep in mind that r 2 R ¼ fr 1 , . . ., r nÀ1 g is a discrete variable.The next theorem, proven in the Supplementary Material, formalizes this observation.
THEOREM 3.2.4Consider a classification rule that is strictly class-wise smart relative to ff c ðx, yÞg and let r mm be the minimax value defined by (13).If Theorem 3.2.3yields a unique point r Ã ¼ r j , then r mm ¼ r j ; otherwise, if Theorem 3.2.3yields an interval ðr j , r jþ1 Þ, then either r j or r jþ1 is the minimax ratio, determined by

Practical implications of the error curves
Recall the practical implications we drew regarding Figure 1 in the Section 1: (i) if c is known, then do separate sampling with n 0 ¼ cn; where the equal sign means 'as close to cn as possible'; (ii) if c % c 0 , then for small n do separate sampling with n 0 ¼ c 0 n; (iii) if one has no idea regarding the value of c, then sampling must be random.Looking at the curves in Figure 3 (and similar figures in the Supplementary Material), we see that the curve for c has its minimum value at r ¼ c or r ¼ c 0 % c and, in the latter case, E½" n ðcÞjr % E½" n ðc 0 Þjr: Hence, the first two recommendations hold for the other classification rules examined.
Going beyond the case where c is known or approximately known, consider the third implication, where one has no good idea concerning the value of c.Then the minimax r mm is an option.Its suitability depends upon the classification rule and feature-label distribution.As we can see from Figure 3, except for extreme values of c, E½" n ðcÞjr mm tends not to be too much greater than E½" n ðcÞjc: Of course, there is a practical problem: while we may well know the classification rule, we will not know the feature-label distribution.

Algorithm to approximate r mm
Algorithm 1 provides an iterative algorithm for approximating r mm when the feature-label distribution is unknown.The procedure is an empirical illustration of Theorem 3.2.4,which requires E½" n ðcÞjr, which now needs to be approximated to approximate r mm : Algorithm 1 uses holdout error estimation.The expectation of this error estimate is taken by iterative random sampling from the dataset.Here we give a brief overview of the algorithm.
The inputs to the algorithm are: dataset denoted by S N , classification rule, number of points to be held out for error estimation from classes 0 and 1, denoted, respectively, by n 0 test , and n 1 test and number of iterations, MaxIters, for computing the expected holdout error estimate.The maximum number of points after holding out test sample points is new and N 1 new : The algorithm searches over possible values for r, from 0 to 1, until a stopping criterion is met.Suppose we fix the total sample size n.Then, considering the first extreme case, r ¼ 0, we need to have at least n points in class 1 to draw sample points from, randomly, i.e.N 1 new !n: On the other hand, when r ¼ 1, we similarly should have N 0 new !n: Hence, we should have n minfN 0 new , N 1 new g, whereby we set n ¼ minfN 0 new , N 1 new g: The algorithm's search criterion is based on Theorem 3.2.4: in a 'while loop' over an increasing sequence of the ratios r, the algorithm computes the estimated slope of the expected error (as a function of c), this being slope new ¼ E½ "0 n jr À E½ "1 n jr (line 22 of the algorithm), obtained by plugging the error estimates (lines 7-21 of the algorithm) into the unknown slope formula E½" 0 n jr À E½" 1 n jr: Because the classification rule is strictly classwise smart, for sufficiently small r, the slope is positive, and it becomes negative for sufficiently large r (refer to Supplementary Material file for further explanation).Once a point is reached at which the sign of the slope becomes non-positive, the 'while loop' stops increasing r.Thereafter, the three different possibilities given by Theorem 3.2.4 are checked, in lines 26-32, and finally a single r mm is returned.Although the returned minimax ratio is only computed for sample size n defined above, the class-sizes can still be conservatively adjusted per r mm in the dataset S N because, for a ratio given by the algorithm, if one increases the sample size, then in the worst case the error is as large as the minimax value returned by the algorithm.
Algorithm 1 Iterative algorithm to approximate r mm (an implementation of Theorem 3.2.4 using an estimate of the expected error estimate)

Adjusting sample sizes
Consider the common situation in which n 0 and n 1 have been determined beforehand, but suppose one knows c.The curves of Figure 3 still apply but we are not free to choose n 0 and n 1 , so that we cannot choose n 0 ¼ cn.Nevertheless, we desire the training data to be apportioned according to c and we want to use as much data as is possible.These conditions mean that for training we want class sample sizes m 0 and m 1 such that m ¼ m 0 þ m 1 is maximized given the constraints m 0 ¼ cm, m 1 ¼ ð1 À cÞm, m 0 n 0 , and To see the effect of adjusting sample sizes, we consider the difference, Áðr, cÞ ¼ E½" n ðcÞjr À E½" m ðcÞjc, between the expected true errors of two cases, n being the original sample size and m the adjusted sample size.When the sampling ratio is r and the true prior probability is c, Áðr, cÞ can be interpreted as the penalty incurred.Figure 4 shows Áðr, cÞ for L-SVM and RBF-SVM for the equal covariance model described in Table 1.The result for the case with unequal covariance matrices can be found in the Supplementary Material.The two parameters r and c take values from 0.06 to 0.94 with the step size of 0.04.As expected, as jr À cj increases, Áðr, cÞ significantly increases.When r % c, Áðr, cÞ % 0, which is always the minimum.The figure shows that except when r is very close to c, Áðr, cÞ40, meaning that, even though m5n, a correct sampling ratio more than compensates for the loss of data due to subsampling.

Population-based minimax theory
The minimax value in the error curves of Figure 3 depends on the sampling distribution and results from the fact that E½" n ðcÞjr is minimized over c for a single value r Ã : In Anderson (1951), a population-based minimax approach was taken to arrive at a 'best' choice for ĉ in (1) in the Gaussian model with common covariance matrix under separate sampling.Here we extend the population-based mimimax approach to arrive at much more general solution than that given by Anderson.It is based upon the fact that the Bayes classifier can be determined via a discriminant involving the class-conditional densities.Anderson also utilized the Bayes classifier in his analysis but he restricted it to the Gaussian model with common covariance matrix, in which case the Bayes classifier is given by LDA using the actual parameters rather than their estimates as in (1).
Given the class-conditional distributions and prior probabilities, the Bayes classifier, Bayes , is determined by the discriminant where Bayes ðxÞ ¼ 1 if D Bayes ðxÞ 0 and Bayes ðxÞ ¼ 0 if D Bayes ðxÞ40: The regions assigned to the two classes are R 1 ¼ fx : D Bayes ðxÞ 0g and R 0 ¼ fx : D Bayes ðxÞ40g: If c is unknown and replaced by ĉ, then the discriminant becomes which defines the classifier ĉ, with corresponding class regions R 1 ð ĉÞ ¼ fx : D prior ðxÞ 0g and R 0 ð ĉÞ ¼ fx : D prior ðxÞ40g.The error associated with ĉ is where for y 2 f0, 1g: R 1 ð ĉÞ and R 0 ð ĉÞ are strictly increasing and decreasing, respectively, for increasing values of log 1À ĉ ĉ : Hence, if the conditional densities are strictly positive, then " 1 ð ĉÞ and " 0 ð ĉÞ are strictly decreasing and increasing, respectively, for increasing values of log 1À ĉ ĉ : The minimax choice selects the value of ĉ that yields the minimum value of the error "ð ĉ, cÞ when taking the maximum error over all values of c 2 ð0, 1 : We state a lemma and a theorem, whose proofs are given in the Supplementary Material that can be used to find minimax solutions.
LEMMA 3.6.1If the class-conditional distributions are strictly positive and " y ð ĉÞ, y ¼ 0, 1, is a continuous function of ĉ, then there exists a unique point ĉmm such that " 0 ð ĉmm Þ ¼ " 1 ð ĉmm Þ and this point corresponds to the minimax solution defined in (19).THEOREM 3.6.2If the class-conditional distributions are strictly positive and " y ð ĉÞ, y ¼ 0, 1, is a continuous function of ĉ, then the minimax solution for the discriminant in (16) is the value of c that gives rise to the maximum Bayes error for the discrimination problem of (15).
To apply the lemma to the Gaussian model with common covariance matrix, note that the discriminant takes the form and the error of the classifier induced by D prior is given by where is the standard normal cumulative distribution function.
To illustrate the effect of the underlying feature-label distribution on ĉmm , we consider a situation similar to that used for Figure 1, except that we allow unequal covariance matrices.Figure 5 contains Bayes-error curves as functions of c for different covariance models.It shows that, except for a common covariance matrix, ĉmm 6 ¼ 0:5, ĉmm being the value of c at which the curve attains its maximum.The curve for equal covariance matrices is constructed analytically; for other values, MC simulation is employed.
Obviously, if ĉmm ¼ c, then the minimax value will perform well.But what happens when ĉmm 6 ¼ c?In fact, the minimax value can work well so long at it is close to the true value, how close depending on the particulars of the problem.For a Gaussian model with common covariance matrix, as used for Figure 1, we consider LDA with random sampling under three scenarios: (i) known c, (ii) minimax ĉmm and (iii) the maximumlikelihood estimate ĉml ¼ n0 n : Figure 6 shows the expected errors (MC estimates) as a function of c for n ¼ 20 and 80.In all cases, known c is the best.When the sample is small, ĉmm outperforms  ĉml for a fairly wide range of c, but this advantage disappears rapidly as n grows.The reason for this behavior is the difficulty of estimating c by ĉml for small samples.All curves show that when c is large or small the minimax solution gives poor results.Let us close this section by noting that the Bayes classifier is intrinsic to the feature-label distribution and, since the minimax choice depends only on the form of the Bayes classifier, it is independent of any particular classification rule.

Concluding remarks
We have shown, via simulations on both synthetic and real examples, that separate sampling with an inappropriate sampling ratio can significantly degrade classification accuracy for classification rules that do not use an explicit estimate of the prior probability.We have demonstrated some fundamental properties of the expected-error curves, developed the minimax samplebased theory for those curves, proposed an algorithm to approximate the minimax value in practice and extended the classical Anderson minimax theory for prior probabilities.We have provided heuristics on how to proceed when the prior probability is known (or known within a small range) and we have proposed a subsampling methodology to implement these heuristics when the class sample sizes are predetermined.Given the ubiquity of separate sampling in biomedicine, it would behoove the medical community to record incidence rates of patient sub-types (population statistics), so that very accurate estimates of class prior probabilities would be available.While this would certainly incur some cost, that cost would be minuscule compared to the costs incurred by the irreproducibility of classification studies.
Fig. 6.Expected true error for different scenarios with random sampling (fixed n 0 þ n 1 with random n 0 and n 1 ) for the same model as used in Fig. 1

Fig. 1 .
Fig.1.The difference E½" n ðcÞ À E½" n ðcÞjr for different values of r as a function of sample size, n, with c ¼ 0.6.The class conditional densities parameters are as follows: l 0 ¼ ð0:3, 0:3Þ, l 1 ¼ ð0:8, 0:8Þ, and D 0 ¼ D 1 ¼ 0:4I, leading to the Bayes error " Bayes ¼ 0:27 Figure 3(d)-(p) show results for two real datasets, where the performance is assessed via holdout error estimation.Each plot includes the expected holdout error estimate versus r for five values of c.In contrast to Figure 3(a)-(h), the curves in Figure 3(d)-(p) are not smooth, which is a result of discrete error estimation.Nonetheless, there still is a crossing point.The solid vertical lines in Figure 3(i)-(p) are fixed on the initial sampling ratios.

Fig. 2 .
Fig. 2. Flowcharts of the processes implemented for the synthetic and real dataset examples.The dashed boxes show one iteration of the MC simulation, repeated to find an approximation for the expected true error, i.e.E Sn ½" n ðcÞjr, and the expected holdout error estimation, i.e.E½ "n ðcÞjr, respectively, for the synthetic and real dataset examples

Fig.
Fig.The first and the second rows show the expected true error rates of four classification rules when covariance matrices are identical and unequal, respectively.In these plots n 0 þ n 1 ¼ 100 is fixed, where n 0 and n 1 are chosen according to the ratio r.The third and the fourth rows show the expected holdout error estimate of the datasets(Yeoh et al., 2002) and(Valk et al., 2004), respectively, where n 0 ¼ 80r d e and n 1 ¼ 80 À n 0 points are randomly selected from classes 0 and 1, respectively, as the training set.The rest of points are held out for error estimation computed via (9)

Fig. 4 .
Fig. 4. The parameter Áðr, cÞ for classification rules, L-SVM and RBF-SVM trained on the sample data with size n ¼ 100 from the common covariance matrix model described in Table 2

Table 2 .
Real datasets used in this article