Abstract

The accuracy of a single diagnostic test for binary outcome can be summarized by the area under the receiver operating characteristic (ROC) curve. Volume under the surface and hypervolume under the manifold have been proposed as extensions for multiple class diagnosis (Scurfield, 1996, 1998). However, the lack of simple inferential procedures for such measures has limited their practical utility. Part of the difficulty is that calculating such quantities may not be straightforward, even with a single test. The decision rule used to generate the ROC surface requires class probability assessments, which are not provided by the tests. We develop a method based on estimating the probabilities via some procedure, for example, multinomial logistic regression. Bootstrap inferences are proposed to account for variability in estimating the probabilities and perform well in simulations. The ROC measures are compared to the correct classification rate, which depends heavily on class prevalences. An example of tumor classification with microarray data demonstrates that this property may lead to substantially different analyses. The ROC-based analysis yields notable decreases in model complexity over previous analyses.

1. INTRODUCTION

It is crucial for cancer diagnosis and treatment to accurately identify the true status of a potentially malignant lesion. With the rapid emergence of DNA microarray technologies, constructing gene expression profiles for different cancer types is a promising approach to cancer classification. An important practical issue is the determination of a parsimonious set of genes which can optimally classify tumors. The problem is particularly challenging when there are more than 2 tumor types, where it may be unclear how to properly combine classification performance across the multiple categories. In this paper, we propose new methodology based on receiver operating characteristic curve analysis for multiple classes and multiple tests. While our focus is the microarray application, the procedures are broadly applicable in the classification setting.

The ROC curve has proven remarkably versatile in medical decision making in binary classification settings (Pepe, 2003). The area under the ROC curve (AUC) has been widely used in clinical practice to summarize the accuracy of diagnostic tests. It is conceptually appealing and can be visually assessed from a ROC plot. With a single test, it may be estimated nonparametrically using the Mann–Whitney statistic and inferences may be based on well-developed theory for U-statistics (Hoeffding, 1948).

In practice, diseases often involve more than 2 categories and diagnostic procedures must assign individuals to one of the outcome types. An example investigated in this manuscript is tumor identification with gene expression data, as in Khan and others (2001), where cDNA microarrays were employed to classify small round blue-cell tumors (SRBCTs) into 4 pathologies. One may evaluate all pairs of classes using 2-class ROC analyses (Obuchowski and others, 2001), but this does not provide an assessment of overall accuracy. This paper considers generalizations of AUC to the multicategory problem, which automatically combines information across pairs of categories. The proposed HUM analysis will be compared with the pairwise AUC method in SRBCT classification in Section 3.

Work on ROC extensions to >2 classes first appeared in the mathematical psychology literature. Scurfield (1996) presented a rigorous extension of 2-class ROC to a finite number (≥3) of classes and generalized the 2-class AUC via an information theoretic derivation. In a subsequent paper, Scurfield (1998) explored the manner in which a multidimensional predictor variable may be used to construct class scores, which may then be incorporated in the decision rules used to generate the ROC manifold. This foundational work focused primarily on the theoretical development of a higher-dimensional ROC framework at the population level. No inferential procedure was proposed for use with randomly sampled data, and empirical results were not presented.

Mossman (1999) independently developed high-dimensional ROC concepts. For the 3-class problem, Mossman (1999) suggested estimating volume under the surface (VUS) using probability ratings for the 3 classes, which are assumed to be provided by a reader for the sampled data. Suppose the disease statuses are A, B, and C. The probability triplet is (EA,EB,EC), where EA,EB,EC >0 and EA + EB + EC = 1. Each component indicates the likelihood of being in the corresponding category. In practice, when there are multiple tests, these probabilities may not be directly available as medical screening tests typically involve numeric data which cannot be easily translated into probability scores.

In Section 2.1, we argue that any method providing estimated class probabilities may be used for statistical inference about hypervolume under the manifold (HUM) with ≥3 classes. The tests may serve as inputs to linear discriminant analysis (Fisher, 1936), nearest-neighbor rules (Fix and Hodges, 1951), classification and regression trees (Breiman and others, 1984), and recent machine learning approaches, with outputs being probability estimates. Dudoit and others (2002) overview the performance of various methods for tumor classification with microarrays. For simplicity, in the empirical studies in Section 3, we adopt multinomial logistic regression, which is widely used in practice and has been employed in 2-class ROC analysis (Pepe, 2003), where it possesses certain optimality properties (McIntosh and Pepe, 2002). However, the proposed methodology is generally applicable with procedures yielding estimated probabilities.

Extending nonparametric inference for AUC, Dreiseitl and others (2000) derived variance estimators for the VUS estimator from Mossman (1999) using U-statistic theory. The theory generalizes to ≥4 classes, but as the number of categories increases, the variance formulae become complicated (Hoeffding, 1984) and are computationally burdensome. Nakas and Yiannoutsous (2004) proposed inferences for HUM with ≥3 classes assuming that the classes are ordered based on a single test score. These methods do not account for variability in estimating class probabilities with multiple tests using multinomial logistic regression (or other procedures) when such probabilities are not directly provided by readers. In Section 2.2, we outline a bootstrap method which is valid for any procedure providing probability assessments, regardless the number of input tests. Simulation studies available in a supplementary file on Biostatistics online demonstrate that the proposed inferences work well with realistic sample sizes.

An alternative measure of diagnostic accuracy is the correct classification rate (CCR). Each subject is assigned to the category with the highest probability. The proportion assigned correctly is the CCR. As noted by Scurfield (1996) and others, the CCR weights accuracy across categories by the category prevalences and may be highly dependent on the prevalences, hence, not comparable across populations. This contrasts with HUM, which is prevalence independent, as discussed in Section 2.3.

An issue in tumor classification is screening expression levels when building predictive models. In Section 3, we present an example of microarray analysis. In the Khan and others (2001) example, the prevalence dependence leads to vastly different single-gene rankings based on HUM and CCR, with the 2 measures being virtually uncorrelated. Employing HUM, we develop a model with 4 expression levels which achieves almost perfect HUM and CCR for diagnosing the 4 tumor types. This parsimony contrasts with Khan and others (2001), where ≈ 100 expression levels gave similar performance.

2. ROC METHODS

2.1 Definition of HUM

We first consider a simple hypothetical example. Suppose we want to classify 2 classes of tumors by using 1 gene expression. Half of subjects are from class 1 and the other from class 2. Using the gene expression gives a 2-dimensional probability assessment vector p(i) =(Ei1, Ei2) (Ei1 > 0, Ei2 >0, and Ei1 + Ei2 = 1), where Eij is the assessed probability that the ith person is in category j =1,2. Classification follows the decision rule: if Ei1 > c, then classify the ith subject to class 1; otherwise classify the ith subject to class 2. When c = 0, all subjects are classified in class 1, and the probability of correctly classifying class 1, denoted by t1, is 1, while the probability of correctly classifying class 2, denoted by t2, is 0. One may vary c and obtain the correct classification probabilities (t1, t2) for the 2 classes accordingly. Connecting the resulting (t1, t2) gives the usual 2-dimensional ROC.

Without loss of generality, t1 and t2 may be described by a functional relationship, t2 =f1(t1), and the AUC may be expressed as AUC = ∫01f1(t1)dt1. The above introduction of the 2-dimensional ROC curve can be extended to higher dimensions. The key is to properly generalize the decision rules based on the probability assessment vectors. With M classes, a decision rule uses (E1, …, EM) to place an item of unknown class into one of the M categories, where E1, E2, …, EM > 0 and E1 + E2 + ⋯ +EM =1. Each component indicates the likelihood of being in the corresponding category. Working with M = 3, Mossman (1999) described 3 types of rules, denoted by RI, RII, and RIII. Next, we discuss generalizations of the 3 rules.

The original RI finds cutoffs to divide the 3-dimensional space into 3 separate regions and to group subjects in the same region. More generally, the rule is: if E1 > c1, then classify the item as class 1; otherwise if E2 > c2, then classify the item as class 2; …; otherwise if EM −1 > cM −1, then classify the item as class M−1; otherwise classify as class M. With M classes, this requires specifying M−1 thresholds and different specifications produce different decision rules. At a fixed set of M −1 thresholds, the probability of correctly classifying the ith class is denoted by ti (i =1, …, M). One plots ti(i =1, …, M) in M-space for all decision rules and then connects these points to form a piecewise linear manifold. The hypervolume under this manifold is a summary measure of the rule's accuracy. Figure 1 presents a hypothetical example with M =3, displaying a 3-dimensional view of the polyhedral ROC surface, where each vertex on this surface corresponds to a vector (t1, t2, t3) for the decision rule. The volume under this 3-dimensional surface yields an estimate of the diagnostic measure proposed in Scurfield (1996).

Fig. 1.

A hypothetical example of 3-class ROC analysis.

Fig. 1.

A hypothetical example of 3-class ROC analysis.

We next discuss the generalization of RII. Suppose there are M subjects, each from one of the M classes, with the ith subject having probability assessment p(i) =(Ei1, …, EiM) (i = 1, 2, …,M). Let vi (i =1, 2, …, M) be an M-dimensional vector whose elements are all 0 except that the ith element equals 1. RII assigns item i to class m if the distance between p(i) and vm is the shortest, comparing to all other p(j), j¬ i. When M =3, this rule requires a decision maker to perform a 3-alternative forced-choice task (Scurfield, 1996).

The decision rule RIII is similar to RII but more useful, in that it suggests a simple method of statistical inference. Suppose there are M subjects, each from one of the M classes, with probability ratings p(1), p(2), …, p(M), respectively. Let v1 =(1, 0, …,0),v2 =(0, 1,…,0),…, vM =(0, 0,…,1) and assign subjects to class k1, k2,…, kM such that forumla is minimized among all possible assignments k1 ≠ ⋯ ≠ kM, where ‖ · ‖ is Euclidean distance. Now, let CR(p(1), p(2),…, p(M)) be 1 if all M subjects are classified correctly, and 0 otherwise. The probability Pr{CR(p(1), p(2),…, p(M)) =1} is the HUM.

The decision rules RI, RII, and RIII yield identical HUM and hence should give interchangeable inferences about HUM based on the same probability assessment vectors. We shall employ RIII as the decision rule for computing HUM, as described below.

We now give a rigorous definition of HUM, as first presented in Scurfield (1996). We use the recursive equations ti = fi −1(t1,…, ti −1),i =2, …, M, to denote the probability that a subject from class i is correctly classified. The HUM is defined by 
graphic
With a single continuous test having known distribution and ordered categories, the above integral simplifies to Nakas and Yiannoutsos (2004), which has a manageable form. Without knowledge of the tests distribution, manipulating the integral is cumbersome. The recursive relationships are hard to estimate nonparametrically and numerical integration may be required, even with parametric assumptions.

If the tests are independent of disease, then the probability of a correct rating is (M!)−1. This occurs because each assignment vector is equally likely, and the correct vector is randomly selected from the set of M! possible vectors. For M = 2, the noninformative value for AUC is 0.5, while for M = 3, this value is 0.17. In the SRBCT example (Khan and others, 2001), with M = 4, it equals 0.04.

2.2 Inference for HUM

Generalizing Dreisetl and others (2000), estimating HUM with M ≥ 3 can be accomplished using the following approach. Consider each subset of M persons where exactly 1 person is from each class. Recall that HUM is the proportion of such subsets in which each of the M persons is correctly classified. Correct classification of the whole subset occurs with probability (M!)−1 by chance alone. Let pij, j = 1, …, ni, be probability assessment vectors for ni subjects from class i = 1,…,M. The HUM estimator is 
graphic
Following U-statistic theory (Hoeffding, 1948), this quantity is asymptotically normal and its variance can be estimated consistently; see Dreiseitl and others (2000) for detailed formulae for VUS with M =3. For M>3, the variance estimator is complicated, involving summations of order (∏ni)2. When pij are unknown, these formulae are not applicable.
Next, we examine the case where pij are unobservable and must be estimated from diagnostic tests. Each subject has q tests scores, with the vector of scores denoted by Tij =(Tij1, …, Tijq)T,i =1, …,M,j = 1,…,ni. One may model the class probabilities with multinomial logistic regression, where for i = 1,…,M,j =1,…,ni,c = 1,…,M: 
graphic
(2.1)
The parameter β =(β1T, β2T, …, βM−1T)T may be estimated by maximizing the log-likelihood function forumla The estimation is implemented in many statistical softwares. Substituting forumla for β in (1) gives the estimated probabilities forumlai = 1, …,M, j = 1,…,ni. Of course, statistical procedures other than logistic regression may be employed, with the main requirement being that the procedure yields the estimated probabilities forumla. These M-dimensional multiplets may then replace pij in the above estimator for HUM, giving 
graphic
(2.2)

With M=2, forumla is a monotone function of the risk score, forumla and the estimator of HUM is equivalent to that from the 2-sample nonparametric Mann–Whitney statistic applied to r1j, j =1,…, n1, and r2j, j =1,…, n2 (Pepe, 2003). With a single test, one can show that estimation of β does not contribute variability to estimation of HUM. That is, forumla using (2.2) and the nonparametric estimator of AUC are asymptotically equivalent, and the variance formula for known pij is valid. For M >2 and/or q >1, the variance is influenced by forumla.

It can be seen that forumla is an M-sample U-statistic with β =forumla. One can show theoretically that forumla is consistent in probability for HUM and asymptotically normal. However, because β is estimated, U-statistic variance formulae for forumla are not generally appropriate. We suggest bootstrap standard errors. For each bootstrap sample, the class probabilities are estimated and then HUM is estimated by plugging these estimated probabilities in (2.2). Denote these estimators by {forumlab:b =1,2,…,B}. The bootstrap standard error for forumla is 
graphic
(2.3)
A 100(1− α)% confidence interval for HUM is 
graphic
(2.4)
where zα/2 is the upper α/2 quantile for the standard normal distribution and the standard error estimator forumla(HUM) is obtained in (2.3). Alternatively, the interval between the α/2 and 1α/2 percentiles of the distribution of {forumlab: b =1,2,…,B} may be used.

Nakas and Yiannoutsous (2004) proposed bootstrap inferences for a nonparametric estimator of HUM with ordered classes and a single test, without probability estimation. Our approach accommodates estimated class probabilities, as needed with multiple test results. An arbitrary estimation procedure may be used to convert multiple tests into scores for multiple unordered classes.

With small sample sizes, the intervals (2.4) may not perform well and may have poor coverage probabilities. The bias-corrected and accelerated (BCa) bootstrap confidence interval (Efron, 1993) may give substantial improvements. The BCa interval is (α12), where 
graphic
and Φ(.) is the standard normal cumulative distribution function. The bias-correction term forumla is obtained directly from the proportion of forumlab, b =1,2,…, B, less than the original estimate forumla. That is, forumla. The acceleration factor a^ can be computed as in Efron (1993).

Under mild conditions, the bootstrap inferences for HUM defined by (2.3), (2.4), and BCa should be valid as long as forumla consistently estimate pij. Difficulties may arise if the procedure used to estimate the pij's is computationally expensive so that repeated application of the procedure in bootstrapping may be prohibitive, particularly with large data sets. In microarray analyses reported in Section 3, computing time is not an issue with realistic sample sizes.

2.3 Comparison of HUM and CCR

The estimation of HUM depends on the estimated pij. In the 2-class setting, when binary logistic regression is used, McIntosh and Pepe (2002) showed that the risk score rij can yield optimal classification using thresholding rules. In practice, it is standard to assign a subject to class j if forumlaij is larger than forumlaik,kj, regardless of the prevalences. This assignment may then be compared to the subject's true status, and the proportion assigned correctly can be calculated by 
graphic
(2.5)
The rate (2.5) is an alternative to HUM for quantifying the classification accuracy of diagnostic tests.

We now define CCR formally. Let the prevalences of the categories (Li and others, 2007) be wi(i =1,2,…,M), where forumla. Given the event Ai={the subject is in the ith class}, the conditional probability of the event Bi={the decision rule assigns the subject to the ith class} is CCRi=Pr(Bi|Ai), which is the CCR for the ith class, i=1,,M. The overall CCR is the prevalence-weighted average of the class-specific rates, CCR =forumla Its interpretation is the probability of correctly classifying a subject randomly selected from the M classes.

Note that each CCRi is analogous to the so-called sensitivity, that is, the conditional probability that a test classifies a subject as diseased given the subject is diseased. When M =2, CCR1 and CCR2 are equivalent to sensitivity and specificity, respectively (see Zhou and others, 2002).

CCR depends heavily on the prevalences and may be very sensitive to changes in these values, especially when the CCRi's are dissimilar. High prevalences for classes with large CCRi may inflate the overall CCR, even though the CCRi's for some classes are small, and vice versa. In contrast, HUM is not sensitive to the wi‘s. The implication is that populations with varying prevalences would have identical HUMs and CCRi's but potentially very different CCRs. In Section 3, models based on HUM evidence superior empirical performance versus those based on prevalence dependent measures. In other applications, for example, public health interventions, when a model's utility depends explicitly on class prevalences, the model's ultimate impact may be better evaluated via CCR.

3. SRBCT CLASSIFICATION EXAMPLES

Gene expression is the process by which a gene's DNA sequence is converted into the functional proteins of the cell. DNA microarrays are commonly used for expression profiling, that is, monitoring expression levels of thousands of genes simultaneously, or comparative genomic hybridization. For each gene, a quantitative expression level can be obtained. If we collect gene expression data from 1000 genes for a group of patients, we then have 1000 test results for each patient. The aim is to parsimoniously combine such information to accurately predict tumor class.

We reanalyzed the tumor data from Khan and others (2001), in which cDNA gene expression profiles were used to classify SRBCTs into 4 distinct categories: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL), and the Ewing tumor family (EWS). Preparation of glass cDNA microarrays, probe labeling, hybridization, and image acquisition were performed according to the National Human Genome Research Institute protocol (Khan and others, 2001). Image analysis was performed using DeArray software (Chen and others, 1997). The cDNA clones were obtained from Research Genetics (Huntsville, AL) and were their standard microarray set. The data set contains 1586 gene levels from 83 samples. It can be downloaded from http://www.nhgri.nih.gov/DIR/Microarray/Supplement.

Khan and others (2001) employed artificial neural networks in tumor classification, while Lee and others (2004) used support vector machines and Yeo and Poggio (2001) used nearest neighbor methods. These models are quite large. Our objective is to explore whether HUM with multinomial logistic regression might yield simpler rules with comparable performance.

In earlier analyses, the sample is split into a training set and a test set. The training set contains 12 NB (19.0%), 20 RMS (31.7%), 8 Burkitt lymphoma (BL, a subset of NHL, 12.7%), and 23 EWS (36.5%) samples. The test set comprises 6 NB (30%), 5 RMS (25%), 3 BL (15%), and 6 EWS (30%) samples.

We first fitted a simple multinomial logistic regression model to the training set, separately for each of the 1586 genes. We used the estimated probabilities to calculate HUM, as described in Section 2. The 25 gene expression levels with the highest HUM values are reported in Table 1, ranked from 1 to 25. We also computed the CCR using (2.5) (Table 1). Note that the values of HUM and CCR are generally quite different and do not agree. In fact, the product moment correlation between HUM and CCR for the entire 1586 genes is −0.237, somewhat negative, and for the top 25 genes, 0.058.

Table 1.

Top 25 gene expression levels ranked by HUM value. CCR is the corresponding overall correct classification rate. CCR[i] is the correct classification rate for the ith class (i = 1, 2, 3, 4). AUC[i:j] is the area under the curve for discriminating the ith and jth classes (i, j = 1, 2, 3, 4). Classes 1, 2, 3, and 4 are EWS, BL, NB, and RMS, respectively

Rank HUM CCR CCR[1] CCR[2] CCR[3] CCR[4] AUC[1:2] AUC[1:3] AUC[1:4] AUC[2:3] AUC[2:4] AUC[3:4] 
0.242 0.317 0.772 0.641 0.980 0.813 1.00 0.933 
0.210 0.365 1.00 1.00 0.974 0.792 0.856 0.729 
0.183 0.317 0.793 0.554 0.980 0.677 1.00 0.954 
0.148 0.190 0.984 0.732 0.776 0.854 0.863 0.783 
0.129 0.317 0.777 0.464 0.874 0.823 0.950 0.883 
0.119 0.127 1.00 0.743 0.537 1.00 1.00 0.733 
0.116 0.317 0.891 0.924 0.985 1.00 1.00 0.754 
0.110 0.127 0.940 0.826 0.539 1.00 0.95 0.800 
0.108 0.190 0.706 0.920 0.520 0.989 0.706 0.933 
10 0.095 0.127 0.973 0.717 0.402 1.00 0.963 0.846 
11 0.087 0.127 0.989 0.833 0.446 1.00 1.00 0.892 
12 0.085 0.317 0.696 0.609 0.787 0.667 0.913 0.858 
13 0.079 0.317 0.522 0.522 0.821 0.521 0.844 0.825 
14 0.078 0.365 1.00 0.949 0.735 0.656 0.938 0.867 
15 0.077 0.317 0.359 0.322 0.998 0.500 0.981 0.996 
16 0.076 0.317 0.332 0.547 0.985 0.281 0.975 0.992 
17 0.074 0.317 0.734 0.369 0.946 0.823 1.00 0.892 
18 0.072 0.190 0.630 1.00 0.611 0.989 0.688 0.971 
19 0.071 0.317 0.793 0.438 0.947 0.771 0.994 0.888 
20 0.069 0.190 0.842 0.996 0.446 1.00 0.894 0.996 
21 0.067 0.317 0.717 0.623 0.919 0.583 0.988 0.975 
22 0.063 0.317 0.783 0.768 0.928 0.917 0.950 0.650 
23 0.062 0.317 0.788 0.482 0.859 0.739 0.931 0.846 
24 0.062 0.317 0.793 0.565 0.891 0.771 0.963 0.896 
25 0.061 0.317 0.875 0.743 0.915 0.958 0.569 0.975 
Rank HUM CCR CCR[1] CCR[2] CCR[3] CCR[4] AUC[1:2] AUC[1:3] AUC[1:4] AUC[2:3] AUC[2:4] AUC[3:4] 
0.242 0.317 0.772 0.641 0.980 0.813 1.00 0.933 
0.210 0.365 1.00 1.00 0.974 0.792 0.856 0.729 
0.183 0.317 0.793 0.554 0.980 0.677 1.00 0.954 
0.148 0.190 0.984 0.732 0.776 0.854 0.863 0.783 
0.129 0.317 0.777 0.464 0.874 0.823 0.950 0.883 
0.119 0.127 1.00 0.743 0.537 1.00 1.00 0.733 
0.116 0.317 0.891 0.924 0.985 1.00 1.00 0.754 
0.110 0.127 0.940 0.826 0.539 1.00 0.95 0.800 
0.108 0.190 0.706 0.920 0.520 0.989 0.706 0.933 
10 0.095 0.127 0.973 0.717 0.402 1.00 0.963 0.846 
11 0.087 0.127 0.989 0.833 0.446 1.00 1.00 0.892 
12 0.085 0.317 0.696 0.609 0.787 0.667 0.913 0.858 
13 0.079 0.317 0.522 0.522 0.821 0.521 0.844 0.825 
14 0.078 0.365 1.00 0.949 0.735 0.656 0.938 0.867 
15 0.077 0.317 0.359 0.322 0.998 0.500 0.981 0.996 
16 0.076 0.317 0.332 0.547 0.985 0.281 0.975 0.992 
17 0.074 0.317 0.734 0.369 0.946 0.823 1.00 0.892 
18 0.072 0.190 0.630 1.00 0.611 0.989 0.688 0.971 
19 0.071 0.317 0.793 0.438 0.947 0.771 0.994 0.888 
20 0.069 0.190 0.842 0.996 0.446 1.00 0.894 0.996 
21 0.067 0.317 0.717 0.623 0.919 0.583 0.988 0.975 
22 0.063 0.317 0.783 0.768 0.928 0.917 0.950 0.650 
23 0.062 0.317 0.788 0.482 0.859 0.739 0.931 0.846 
24 0.062 0.317 0.793 0.565 0.891 0.771 0.963 0.896 
25 0.061 0.317 0.875 0.743 0.915 0.958 0.569 0.975 
Table 1.

Top 25 gene expression levels ranked by HUM value. CCR is the corresponding overall correct classification rate. CCR[i] is the correct classification rate for the ith class (i = 1, 2, 3, 4). AUC[i:j] is the area under the curve for discriminating the ith and jth classes (i, j = 1, 2, 3, 4). Classes 1, 2, 3, and 4 are EWS, BL, NB, and RMS, respectively

Rank HUM CCR CCR[1] CCR[2] CCR[3] CCR[4] AUC[1:2] AUC[1:3] AUC[1:4] AUC[2:3] AUC[2:4] AUC[3:4] 
0.242 0.317 0.772 0.641 0.980 0.813 1.00 0.933 
0.210 0.365 1.00 1.00 0.974 0.792 0.856 0.729 
0.183 0.317 0.793 0.554 0.980 0.677 1.00 0.954 
0.148 0.190 0.984 0.732 0.776 0.854 0.863 0.783 
0.129 0.317 0.777 0.464 0.874 0.823 0.950 0.883 
0.119 0.127 1.00 0.743 0.537 1.00 1.00 0.733 
0.116 0.317 0.891 0.924 0.985 1.00 1.00 0.754 
0.110 0.127 0.940 0.826 0.539 1.00 0.95 0.800 
0.108 0.190 0.706 0.920 0.520 0.989 0.706 0.933 
10 0.095 0.127 0.973 0.717 0.402 1.00 0.963 0.846 
11 0.087 0.127 0.989 0.833 0.446 1.00 1.00 0.892 
12 0.085 0.317 0.696 0.609 0.787 0.667 0.913 0.858 
13 0.079 0.317 0.522 0.522 0.821 0.521 0.844 0.825 
14 0.078 0.365 1.00 0.949 0.735 0.656 0.938 0.867 
15 0.077 0.317 0.359 0.322 0.998 0.500 0.981 0.996 
16 0.076 0.317 0.332 0.547 0.985 0.281 0.975 0.992 
17 0.074 0.317 0.734 0.369 0.946 0.823 1.00 0.892 
18 0.072 0.190 0.630 1.00 0.611 0.989 0.688 0.971 
19 0.071 0.317 0.793 0.438 0.947 0.771 0.994 0.888 
20 0.069 0.190 0.842 0.996 0.446 1.00 0.894 0.996 
21 0.067 0.317 0.717 0.623 0.919 0.583 0.988 0.975 
22 0.063 0.317 0.783 0.768 0.928 0.917 0.950 0.650 
23 0.062 0.317 0.788 0.482 0.859 0.739 0.931 0.846 
24 0.062 0.317 0.793 0.565 0.891 0.771 0.963 0.896 
25 0.061 0.317 0.875 0.743 0.915 0.958 0.569 0.975 
Rank HUM CCR CCR[1] CCR[2] CCR[3] CCR[4] AUC[1:2] AUC[1:3] AUC[1:4] AUC[2:3] AUC[2:4] AUC[3:4] 
0.242 0.317 0.772 0.641 0.980 0.813 1.00 0.933 
0.210 0.365 1.00 1.00 0.974 0.792 0.856 0.729 
0.183 0.317 0.793 0.554 0.980 0.677 1.00 0.954 
0.148 0.190 0.984 0.732 0.776 0.854 0.863 0.783 
0.129 0.317 0.777 0.464 0.874 0.823 0.950 0.883 
0.119 0.127 1.00 0.743 0.537 1.00 1.00 0.733 
0.116 0.317 0.891 0.924 0.985 1.00 1.00 0.754 
0.110 0.127 0.940 0.826 0.539 1.00 0.95 0.800 
0.108 0.190 0.706 0.920 0.520 0.989 0.706 0.933 
10 0.095 0.127 0.973 0.717 0.402 1.00 0.963 0.846 
11 0.087 0.127 0.989 0.833 0.446 1.00 1.00 0.892 
12 0.085 0.317 0.696 0.609 0.787 0.667 0.913 0.858 
13 0.079 0.317 0.522 0.522 0.821 0.521 0.844 0.825 
14 0.078 0.365 1.00 0.949 0.735 0.656 0.938 0.867 
15 0.077 0.317 0.359 0.322 0.998 0.500 0.981 0.996 
16 0.076 0.317 0.332 0.547 0.985 0.281 0.975 0.992 
17 0.074 0.317 0.734 0.369 0.946 0.823 1.00 0.892 
18 0.072 0.190 0.630 1.00 0.611 0.989 0.688 0.971 
19 0.071 0.317 0.793 0.438 0.947 0.771 0.994 0.888 
20 0.069 0.190 0.842 0.996 0.446 1.00 0.894 0.996 
21 0.067 0.317 0.717 0.623 0.919 0.583 0.988 0.975 
22 0.063 0.317 0.783 0.768 0.928 0.917 0.950 0.650 
23 0.062 0.317 0.788 0.482 0.859 0.739 0.931 0.846 
24 0.062 0.317 0.793 0.565 0.891 0.771 0.963 0.896 
25 0.061 0.317 0.875 0.743 0.915 0.958 0.569 0.975 

The CCRs for individual classes were also computed (see Table 1). It is interesting that all genes are perfect classifiers for a single tumor type but are unable to correctly classify the other 3 tumor types. This phenomenon holds for all 1586 genes and explains why the overall CCR for each gene equals the sample prevalence of the tumor that the gene classifies correctly. In this example, forumla provides almost no information about the diagnostic accuracy of single genes. It is a copy of the prevalence for the tumor type for which the gene is a perfect classifier.

Following Obuchowski and others (2001), pairwise AUCs were estimated by conditioning the multinomial logistic regression model on the 2 classes under consideration and choosing the class with the largest estimated conditional probability. As mentioned in Section 2.2, the AUC estimates from (2.1) with M =2 are asymptotically equivalent to nonparametric AUC estimates, as in pairwise analyses. The AUCs are given in Table 1 for the top 25 genes. Observe that when a gene has a high CCR for a certain class, its AUC for this class versus any other class is high. Genes with high HUMs tend to have high pairwise AUCs. Thus, one may view HUM as a useful summary of the pairwise information.

Recall from Section 2.1 that with 4 classes an HUM of 0.04 corresponds to tests being noninformative. The top 2 ranked genes have HUM's > 0.2, the genes ranked 3–9 have HUM's from 0.1 to 0.2, and the genes ranked 10–25 have HUM's from 0.06 to 0.1. These findings show that HUM's are elevated and that there are substantial differences among the top 25 genes, with a few HUM standouts.

We selected the top 10 genes as covariates to fit a multiple multinomial logistic regression with the training set. The estimated HUM on the training set is 1. When we used the forumla from this model to diagnose the tumor types by assigning each sample to the tumor type with the highest estimated probability, we achieved 100% CCR for the training set. We applied the fitted model to the test set and calculated the corresponding probability values for the 20 samples in this set. The CCR for the test set is 20/2 =100%.

We then used forward selection on the training set with the top 10 genes to build a more parsimonious model with equivalent classification performance. We started with the model with the top-ranked gene according to HUM and added that gene from the remaining 9 which maximized HUM. Keeping genes from previous iterations, we repeatedly maximized HUM over the remaining genes. The resulting models from the first 4 steps of the selection procedure are shown in Table 2, along with the estimated HUM and CCR on the training set. One sees a marked increase in both HUM and CCR as the number of genes increases. With 4 genes, both HUM and CCR equal 1; the model is a perfect classifier on the training set. The CCR for this model on the test samples is 19/20 =95%, suggesting good generalizability. Adding gene 8 from Table 1 yields perfect performance on the test set.

Table 2.

Results of forward selection using top 10 ranked genes according to HUM

Number of Genes Ranks HUM (0.95 confidence interval) CCR 
0.242 (0.059, 0.425) 31.7 
1, 6 0.779 (0.653, 0.905) 74.6 
1, 6, 9 0.922 (0.831, 1.000) 90.5 
1, 2, 6, 9 1.000 (1.000, 1.000) 100 
Number of Genes Ranks HUM (0.95 confidence interval) CCR 
0.242 (0.059, 0.425) 31.7 
1, 6 0.779 (0.653, 0.905) 74.6 
1, 6, 9 0.922 (0.831, 1.000) 90.5 
1, 2, 6, 9 1.000 (1.000, 1.000) 100 
Table 2.

Results of forward selection using top 10 ranked genes according to HUM

Number of Genes Ranks HUM (0.95 confidence interval) CCR 
0.242 (0.059, 0.425) 31.7 
1, 6 0.779 (0.653, 0.905) 74.6 
1, 6, 9 0.922 (0.831, 1.000) 90.5 
1, 2, 6, 9 1.000 (1.000, 1.000) 100 
Number of Genes Ranks HUM (0.95 confidence interval) CCR 
0.242 (0.059, 0.425) 31.7 
1, 6 0.779 (0.653, 0.905) 74.6 
1, 6, 9 0.922 (0.831, 1.000) 90.5 
1, 2, 6, 9 1.000 (1.000, 1.000) 100 

The genes in the final model have ranks 1, 2, 6, and 9 according to HUM. Inspecting their CCRi's, each gene is a perfect classifier for a different tumor type: gene 1 for RWS, gene 2 for EWS, gene 6 for BL, and gene 9 for NB. This raises interesting questions. Can performance similar to that of the final model be achieved by randomly selecting 4 genes which are perfect classifiers for the 4 classes? Or, is the HUM of the individual genes strongly linked to their performance in the 4 gene model?

We randomly chose 4 genes from the genes ranked 10–100, 101–1000, and 1001–1586, according to their individual HUM values, where each of the 4 genes correctly identifies a different tumor type. We then used a model with these 4 genes to predict tumor type. For the 4 genes ranked 10–100, the model has estimated HUM 0.732 (0.557, 0.907) and CCR 0.492 on the training set. For the genes ranked 101–1000, the HUM is 0.609 (0.383, 0.836) and the CCR is 0.429. Finally, for the genes ranked 1001–1586, the HUM is 0.355 (0.097, 0.613) and the CCR is 0.365. In this experiment, the HUMs of the individual genes is critically important to the performance of the classification schemes.

One might also be interested in similar models using the pairwise AUC's. For example, we can choose gene 11, 14, 15, and 18 in Table 1 since each of the 4 genes correctly identifies a different tumor type. For gene 11, the mean of the relevant pairwise AUC's (AUC[1:2]+AUC[2:3]+AUC[2:4])/3 =0.996 since it correctly identifies class 2. Similarly, for genes 14, 15, and 18, the means of the relevant pairwise AUC's are 0.895, 0.988, and 0.987, respectively. These average pairwise AUC values are as good as those for genes 1, 2, 6, and 9. Nevertheless, the CCR on the training set for these 4 genes jointly is only 0.651. Such a model is much less accurate than the model using genes 1, 2, 6, and 9, which all possess high HUM values. Hence, pairwise AUC is not as useful as HUM in this example.

The model we obtained by first ranking genes using HUM and then employing forward selection is much more parsimonious than in previous analyses. Lee and others (2004) used support vector machines requiring 20+ genes to correctly classify all test samples. Khan and others (2001) employed neural networks where > 96 genes were needed to achieve perfect classification performance. Given the similar classification accuracies of the different procedures, our approach based on multinomial logistic regression would appear to be attractive for practitioners, owing to its simplicity and ease of interpretation.

4. DISCUSSION

Although the multidimensional ROC framework and corresponding HUM were originally introduced in Scurfield (1996, 1998), their practical utility in empirical analysis was not investigated. Mossman (1999) stimulated statistical work attempting to translate the theoretical HUM construct into practically useful inferences. To our knowledge, there has not been a wholly satisfactory solution for resolving difficulties with multiple tests. It is usually not possible to obtain direct probability assessments from such tests. Simple decision rules, like those in Nakas and Yiannoutsos (2004), are not flexible enough for many applications, like microarray data, where there are many tests and unordered categories.

Our method overcomes this problem by using estimated class probabilities. While the simple multinomial logistic regression model (2.1) seemed to give satisfactory results in the microarray analyses, any classification method generating such probability estimates could have been used.

The data analysis in Section 3 demonstrated that ranking genes using HUM instead of CCR may be preferable for determining inputs to classification procedures (Dudoit and others, 2002). This was evident in the SRBCT analysis, where HUM and CCR were virtually uncorrelated across genes. The resulting models based on HUM were quite parsimonious compared to those from earlier analyses (Khan and others, 2001; Lee and others, 2004). The CCR provided almost no information about individual genes, and models based on CCR were clearly inferior to those based on HUM.

FUNDING

National Cancer Institute, National Institutes of Health.

Conflict of Interest: None declared.

References

Brieman
L
Friedman
JH
Olshen
R
Stone
CJ
Classification and Regression Trees
1984
Belmont, CA
Wadsworth
Chen
Y
Dougherty
ER
Bittner
ML
Ratio-based decisions and the quantitative analysis of cDNA microarray images
Biomedical Optics
1997
, vol. 
2
 (pg. 
364
-
374
)
Dreiseitl
S
Ohno-Machado
L
Binder
M
Comparing three-class diagnostic tests by three-way ROC analysis
Medical Decision Making
2000
, vol. 
20
 (pg. 
323
-
331
)
Dudoit
S
Fridlyand
J
Speed
T
Comparison of discrimination methods for the classification of tumors using gene expression data
Journal of the American Statistical Association
2002
, vol. 
97
 (pg. 
77
-
87
)
Efron
B
Tibshirani
R
An Introduction to the Bootstrap
1993
New York
Chapman & Hall
Fisher
RA
The use of multiple measurements in taxonomic problems
Annals of Eugenics
1936
, vol. 
7
 (pg. 
179
-
188
)
Fix
E
Hodges
J
Discriminatory analysis, nonparametric discrimination: consistency properties
Technical Report
1951
Randolph Field, TX
US Air Force, School of Aviation Medicine
Hoeffding
W
A class of statistics with asymptotically normal distribution
Annals of Mathematical Statistics
1948
, vol. 
19
 (pg. 
293
-
325
)
Khan
J
Wei
J
Ringer
M
Saal
L
Ladanyi
M
Westermann
F
Berthold
F
Schwab
M
Atonescu
C
Peterson
C
and others
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Nature Medicine
2001
, vol. 
7
 (pg. 
673
-
679
)
Lee
Y
Lin
Y
Wahba
G
Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data
Journal of American Statistical Association
2004
, vol. 
99
 (pg. 
67
-
81
)
Li
J
Fine
JP
Safdar
N
Prevalence dependent diagnostic accuracy measures
Statistics in Medicine
2007
, vol. 
26
 (pg. 
3258
-
3273
)
McIntosh
MW
Pepe
MS
Combining several screening tests: optimality of the risk score
Biometrics
2002
, vol. 
58
 (pg. 
657
-
664
)
Mossman
D
Three-way ROCs
Medical Decision Making
1999
, vol. 
19
 (pg. 
78
-
89
)
Nakas
CT
Yiannoutsos
CT
Ordered multiple-class ROC analysis with continuous measurements
Statistics in Medicine
2004
, vol. 
23
 (pg. 
3437
-
3449
)
Obuchowski
NA
Applegate
KE
Goske
MJ
Arheart
KL
Myers
MT
Morrison
S
The differential diagnosis for multiple diseases: comparison with the binary-truth state experiment in two empirical studies
2001
, vol. 
8
 
Academic Radiology
(pg. 
947
-
954
)
Pepe
M
The Statistical Evaluation of Medical Tests for Classification and Prediction
2003
New York
Oxford
Scurfield
BK
Multiple-event forced-choice tasks in the theory of signal detectability
Journal of Mathematical Psychology
1996
, vol. 
40
 (pg. 
253
-
269
)
Scurfield
BK
Generalization of the theory of signal detectability to n-event m-dimensional forced-choice tasks
Journal of Mathematical Psychology
1998
, vol. 
42
 (pg. 
5
-
31
)
Yeo
G
Poggio
T
Multiclass classification of SRBCTs
Technical Report AI Memo 2001-018, CBCL Memo 206
2001
Cambridge, MA
MIT
Zhou
X-H
Obuchowski
NA
McClish
DK
Statistical Methods in Diagnostic Medicine
2002
New York
Wiley