## Abstract

**Motivation:** One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number *n* of samples is relatively small compared to the number *p* of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when *n* is (far) larger than *p*. This naturally calls for the use of a dimension reduction procedure together with the classification one.

**Results:** In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.

**Availability:** Software that implements the procedures and data source on which this paper focuses are freely available at http://www-lmc.imag.fr/SMS/membres/Gersende_Fort,Sophie_Lambert.html

**Contact:**sophie.lambert@imag.fr

## INTRODUCTION

Microarray technology generates a vast amount of data by measuring, through the hybridization process, the levels of virtually all the genes expressed in a biological sample. One can expect that knowledge gleaned from microarray data will contribute significantly to advances in fundamental questions in biology as well as in clinical medicine. One important goal of analyzing microarray data is to classify the samples. To cite a few, Golub *et al*. (1999) have considered classification of acute leukemia and Alon *et al*. (1999) have addressed the cluster analysis of tumor and normal colon tissues. The approaches developed in these papers consist in discrimination methods and machine learning methods [see Dudoit *et al*. (2002) for a comparative study].

In microarray studies, the number of samples, *n*, is relatively small compared to the number of genes, *p*, usually in thousands. Unless a preliminary variable selection step is performed, standard statistical methods in classification perform poorly because there are far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. For instance, the pooled within-class sample covariance matrix in Fisher's linear discriminant function is singular if *n* < *p* + 2. Even if all genes can be used as in support vector machines, it seems to be not sensible to use all the genes. Indeed, this use allows the presence of the noise associated with genes of little or no discrimination power. That inhibits and degrades the performances of the classification rules in their application to unclassified tumor. In this situation, dimension reduction is needed to reduce the high *p*-dimensional gene space. In most previously mentioned works, the authors have used univariate methods for reducing the number of genes. Alternative approaches to handle the dimension reduction problem can also be used (see for instance Ghosh, 2002; Nguyen and Rocke, 2002; West *et al*., 2001; Antoniadis *et al*., 2003).

Similar data structures have been encountered in the field of chemometrics. The method of partial least squares (PLS) (Wold, 1975; Naes and Martens, 1985; Helland, 1988) has been found to be a useful dimension reduction technique as well as principal component regression (PCR) [Massy, 1965; see Frank and Friedman (1993) for a statistical view of PLS and PCR]. In the context of microarrays, the purpose of PCR is to produce orthogonal tumor descriptors that reduce the dimension to only a few gene components (super-genes) (West *et al*., 2001). But the dimension reduction is achieved without regard to the response variable and may be inefficient. This is the reason why PLS looks more adapted than PCR for dimension reduction based prediction. Indeed, PLS components are chosen so that the sample covariance between the response and a linear combination of the *p* predictors (genes) is maximum.

Nguyen and Rocke (2002) proposed using PLS for dimension reduction as a preliminary step to classification, based either on linear logistic discrimination, or linear or quadratic discriminant analysis. However, this seems to be intuitively unappealing because PLS is really designed to handle continuous responses and models that do not suffer from heteroscedasticity as it is the case for Bernoulli or multinomial data. Furthermore, in practice we have observed problems in the convergence of the iteratively reweighted least squares (IRLS) algorithm, which is the usual procedure for solving the maximum likelihood (ML) equation in the field of the generalized linear models (GLM). Indeed, for logistic regression, it is well known that convergence poses a long standing problem. Infinite parameter estimates can occur depending on the configuration of the sample points in the observation space (Albert and Anderson, 1984).

Marx (1996) proposed an extension of PLS to a categorical response variable and illustrated the developments from a spectroscopy example. His approach embedded the usual PLS steps within the IRLS. Unfortunately, we have observed that this algorithm does not converge in many cases of interest (such as in the applications considered in this paper). More recently, Ding and Gentleman (2004, http://www.bepress.com/bioconductor/) proposed an approach based on this procedure. They phrased the problem in a GLM setting and applied Firth's procedure to avoid (quasi)separation.

To deal with the high-dimension problem, another approach consists in penalizing the likelihood. Eilers *et al*. (2001) propose useing the Ridge penalized logistic regression in order to both stabilize the statistical problem and remove numerical degeneracy due to multicollinearity. They have shown that this method appears to work well with microarray data. Note that this method is not a dimension-reduction technique. Indeed all explanatory variables are allowed into the regression model. From the log-likelihood a so-called Ridge penalty is subtracted. All the genes contribute, which can inhibit and degrade the performances of the classification rules. Note that we can find alternative approaches (see, for example, Huang and Pan, 2003; Ghosh, 2003) for which the classification problem is not viewed as a problem in a logistic regression.

In this paper, we extend the PLS method to a binary response variable. To do that, we want to substitute the categorical response variable in the input of PLS by a continuous-valued pseudo-response variable whose expected value has a linear relationship with the covariates. The limiting pseudo-response variable in the IRLS algorithm seems to be a good candidate. Unfortunately, in the present situation, ‘small *n*, large *p*’, IRLS no longer works since the limiting pseudo-response variable is, in norm, infinite. The idea developed here is to penalize with a Ridge penalty the likelihood criterion in order to constrain the pseudo-response variable to be finite. That is, our procedure combines a Ridge penalty step and a PLS step and the dimension-reduction step is incorporated in the classification step. Here we present the classification rule for the binary response variable indicating a normal or colon tumor, for instance. Nevertheless, our approach remains valid for multi-categorical response variables. But the binary case is the simplest case which allows us to point out whether such a procedure works well or not and why.

This paper is organized as follows. The Methods section is the methodological part of this paper. It contains a description of the logistic regression and linear discrimination. We then recall the Ridge regression method and derive a weighted PLS algorithm in order to address the dimension reduction in heteroscedastic models. We then introduce an extension of PLS to GLM based on the Ridge penalty, and analyze the Nguyen and Rocke, Marx, Ding and Gentleman and Eilers *et al*. algorithms. Applications to disease classification through microarray are presented in the Results section.

## METHODS

### Some basic ingredients

After introducing some notations, we recall the principle of linear logistic discrimination, some results on the existence of the maximum likelihood estimator and the classical algorithm used to compute it. Next, we present a regularization method, a penalized maximum likelihood method, and a dimension-reduction technique, PLS.

### Notations

Expression levels of the *p* genes for the *n* microarray samples are collected in an *n* × *p data matrix X* = (*x*_{i/j}), 1 ≤ *i* ≤ *n*, 1 ≤ *j* ≤ *p*. The entry *x*_{i/j} is the expression level of the variable ‘gene’ *j* in the microarray sample *i*, and the *i*-th row *X*_{i,·} is the vector of a gene expression profile for sample *i*. More generally, for a matrix *A*, *A*_{i,j} denotes the entry (*i*, *j*), *A*_{·,j} (resp. *A*_{i,·}) denotes the column vector collecting the column #*j* (resp. the row #*i*). *A*_{i1:i2,j1:j2} is the (*i*_{2} − *i*_{1} + 1) × (*j*_{2} − *j*_{1} + 1) matrix formed by picking out the rows *i*_{1} to *i*_{2} and columns *j*_{1} to *j*_{2} of *A*; *A*_{·,j1:j2} is formed by picking out the columns *j*_{1} to *j*_{2} of *A*. The labels of the *n* microarray samples are collected in a {0,…, (*g* − 1)}^{n}-valued vector **y**. In supervised machine learning, each sample is thought to originate from a specific class *k* ∈ {0,…,*g* − 1} where the number of possible classes *g* is known and fixed. A classifier can be regarded as a function *G*: ℝ^{p} → {0,…,g−1} that predicts the unknown class label of a new tissue sample *x* ∈ℝ^{p} by *G*(*x*). We assume that the data (**y**, *X*) collect observations of *n* statistically independent and identically distributed random pairs (*Y*, ** X**). We choose a logit model for the data (see, e.g. Fahrmeir and Tutz, 2001), and the logistic discrimination (LD) method for the classification procedure (see, e.g. Timm, 2002). In the terminology of the regression analysis, (

*X*

_{·,j})

_{1≤j≤p}are the predictor variables and (y

_{i})

_{1≤i≤n}the response variables. We include an intercept into the regression model, and denote by

*Z*= [𝕀

_{n}

*X*] the

*design matrix*of size

*n*× (

*p*+ 1), where 𝕀

_{n}= (1,…,1)′ stands for the column vector of length

*n*(′ denotes the transposition operator).

### Linear LD

In logit models, the conditional class probability—or equivalently, the conditional expectation of *Y* given **X**—P(*Y* = 1|**X** = x;γ) is related to *x* and some parameter γ ∈ ℝ^{p+1} through the relation P(*Y* = 1| **X** = x;γ) = *h*([1 *x*′]γ) where *h*(η) = 1/[1 + exp(−η)]. The quantity [1 *x*′] γ is called the linear predictor. γ is an unknown parameter that has to be estimated from the data. In LD, it is usually estimated by

*l*(γ), is given by

*i*≤

*n*, η

_{i}(γ) = (

*Z*γ)

_{i}.

For a vector *z* = [1 *x*′], the predicted class

*n*≪

*p*, the existence and unicity of

### Maximum likelihood estimate and IRLS

We say that the ML estimate exists if there exists γ ∈ ℝ^{p+1} of finite norm which is a maximizer of the concave log-likelihood *l*. Hence, such an estimate is a solution to the normal equation *Z*′(**y** − π(γ)) = 0, where π(γ) is the ℝ^{n}-valued mean vector with coordinates π_{i}(γ) = *h*(η_{i}(γ)).

If *Z* is full column-rank, the solution, when exists, is unique. The existence of a solution, when *Z* is full column-rank, depends on the configuration of the *n* sample points in the observation space ℝ^{p} (Albert and Anderson, 1984; Santner and Duffy, 1986). There are three exclusive situations: separate, quasi-separate and overlap. In the first two cases, there exists

*i*such that

**y**

_{i}= 1 and

*i*such that

**y**

_{i}= 0; roughly speaking, this means that there exists a hyperplane that exactly separates the two classes, except maybe some points that can belong to the hyperplane. In such a case,

*l*reaches its maximum as ‖γ‖ tends to +∞ and the ML estimate does not exist. In the third case, the estimate exists and is computed as the limit of a converging Newton–Raphson sequence; this algorithm is known as the IRLS algorithm (Green, 1984). Let

*W*(γ) be the diagonal

*n*×

*n*matrix with diagonal entries

*W*

_{i,i}(γ) = π

_{i}(γ)[1 − π

_{i}(γ)]. Each iteration divides into two steps,

*W*

^{(t)}and π

^{(t)}are shorthand notations for

*W*(γ

^{(t)}) and π(γ

^{(t)}). IRLS can thus be considered as an iterative weighted least square regression of an ℝ

^{n}-valued pseudo-variable

*z*

^{(t)}onto the columns of

*Z*. We denote this algorithm by IRLS (

*y*,

*x*)

When *Z* is not full column-rank, the parameter is not identifiable and the ML estimate is not unique when exists; applying the above iterations [Equations (2) and (3)] by replacing the inverse matrix (3) with the Moore–Penrose pseudo-inverse, yields the parameter estimate which is of minimal norm among all the solutions. In practice, in the present statistical framework *n* ≤ *p*, *n* = *rank*(*Z*) and the minimal norm solution verifies for all 1 ≤ *i* ≤ *n*, (*Z*γ)_{i} = ln(**y**_{i}) − ln(1 − **y**_{i}); it is thus of infinite norm and the ML estimate cannot exist. This calls for regularization methods.

### Ridge penalty and RIRLS

The Ridge estimator (Le Cessie and Van Houwelingen, 1992

*l*

^{*}(γ) =

*l*(γ) − 0.5λγ′Σ

^{2}γ, where λ > 0 is the shrinkage parameter, and Σ

^{2}is a diagonal matrix with entries

*X*, and does not apply to the location parameter γ

_{1}.

**y**,

*X*, λ) (shorthand notation for Ridge-IRLS) this algorithm. It consists in replacing in IRLS, the weighted regression (3) by a weighted Ridge regression γ

^{(t+1)}= (

*Z*′

*W*

^{(t)}

*Z*+ λΣ

^{2})

^{−1}

*Z*′

*W*

^{(t)}

*z*

^{(t)}, where

*z*

^{(t)}is built as in Equation (2).

λ controls the amount of shrinkage in the data and can be chosen as the minimum, over a given range, of the BIC criterion −2

### Weighted PLS

PLS is both a tool for linear regression and a tool for dimension reduction (Wold, 1975Naes and Martens, 1985Helland, 1988). Let **y** ∈ ℝ^{n} be a response vector, *X* be an *n* × *p* data matrix and *W* be a symmetric positive definite *n* × *n* matrix. PLS (i) defines κ *W*-orthogonal *scores* (*t*_{k})_{1≤k≤κ}, linear combinations of the columns of *Z* such that for all *k*,

*W*-weighted least squares regression of

**y**on (𝕀

_{n},

*t*

_{1},…,

*t*

_{κ}). This yields the decomposition

*f*

_{κ+1}is

*W*-orthogonal to the vectors (𝕀

_{n},

*t*

_{1},…,

*t*

_{κ}). Contrary to classical dimension-reduction methods (such as PCR), the scores depend on the response vector

**y**; roughly speaking, given (

*t*

_{k})

_{1≤k≤l},

*t*

_{l+1}is the linear combination of the columns of

*Z*, i.e. is of the form

*t*

_{l+1}=

*Zc*, which is the most informative on the residual response variable

*f*

_{l+1}, when information is defined in terms of the weighted covariance |

*Cov*(

*Zc*,

*f*

_{l+1})|(

*W*) (Helland, 1988). While the maximal number of PLS scores κ

_{max}can be lower than rank(

*X*), in practice, it is often equal to rank(

*X*). Helland (1988) shows that the weighted PLS (WPLS) regression applied with κ = κ

_{max}is nothing more than the weighted least squares regression. In the literature, PLS is usually derived with

*W*=

*I*, the identity matrix; we thus detail the algorithm in the weighted case. Let

*p*×

*p*positive-definite diagonal matrix with diagonal entries Σ

_{j,j},

*j*≥ 2, given by Equation (4).Hereafter, this procedure is denoted by WPLS (

**y**,

*X*,

*W*, κ). If

*Z*is full column-rank, this algorithm determines a unique estimate

*Z*is not full column-rank, the procedure above yields the minimal norm vector among all the vectors verifying

**y**−

*f*

_{k+1}=

*Z*γ.

- \({X}^{s}=X{\tilde{\Sigma }}^{-1}\),
*t*_{0}= 𝕀_{n},*E*_{0}=*X*^{s};*f*_{0}=**y**. - \[\begin{array}{c}{q}_{k}={t}_{k}^{\prime }W{f}_{k}/\left({t}_{k}^{\prime }W{t}_{k}\right),\hbox{ \hspace{1em} }{f}_{k+1}={f}_{k}-{q}_{k}{t}_{k},\\ {E}_{k+1}={E}_{k}-{t}_{k}{t}_{k}^{\prime }W{E}_{k}/\left({t}_{k}^{\prime }W{t}_{k}\right),\\ {t}_{k+1}={E}_{k+1}{E}_{k+1}^{\prime }W{f}_{k+1}.\end{array}\]

### Ridge PLS

A direct application of PLS to GLM seems to be intuitively unappealing because PLS handles continuous responses. This is the reason why, in order to extend PLS to GLM, we want to replace the binary vector **y** with a pseudo-response variable whose expected value has a linear relationship with the covariates. The pseudo-response variable *z*^{∞} at the convergence of RIRLS(**y**, *X*, λ) verifies this condition and is thus our candidate: it is of the form

*W*

^{∞})

^{−1}. The main advantage of choosing

*z*

^{∞}instead of, for example, the pseudo-variable at the convergence of IRLS—which has a linear structure too—is that this allows the combination of a regularization step and of a dimension-reduction step. In addition, this extension is always well defined: recall indeed that in some cases (including the case

*n*≤

*p*), the ML estimate does not exist so that the pseudo-variable ‘at convergence’ of IRLS is of infinite norm.

As a consequence, we propose a new procedure which combines Ridge penalty (the regularization step) and PLS (the dimension-reduction step) the so-called Ridge PLS (RPLS). Let λ be some positive real constant and κ be some positive integer. RPLS divides in two steps:A detailed implementation is given in the Appendix. The first step builds a continuous response variable *z*^{∞} for the input of PLS, the ‘dispersion matrix’ of which is [*W*^{∞}]^{−1}. This explains the call, in the second step, to a weighted PLS procedure with weight *W*^{∞}. The use of *X*^{s} in WPLS and of Σ in the penalized ridge criterion makes our procedure invariant to the scaling of the data matrix.

(

*z*^{∞},*W*^{∞}) ←*RIRLS*(**y**,*X*, λ);- \({\widehat{\gamma }}^{\hbox{ PLS },\kappa }\leftarrow \hbox{ WPLS }({z}^{\infty },X,{W}^{\infty },\kappa ).\).

RPLS depends on two parameters, λ and κ. λ is determined at the end of Step 1, as minimizing the BIC criterion (see the Ridge penalty section), and thus independently of κ. In the linear regression setting, the optimal choice of κ when dimension reduction is achieved by PLS, is to our best knowledge, an open problem: the non-linear dependence of

*f*

_{κ+1}impossible. Finally, observe that RPLS provides an estimate

**y**,

*X*, λ and κ).

We are now able to provide an answer to the classification problem in a high-dimensional setting: our classification procedure consists in applying LD with the estimate

### Comparison with other approaches

We briefly review some regression procedures that use PLS as the dimension-reduction tool to manage the high-dimensional setting. We outline their interest and in some cases, explain their poor behavior.

#### Nguyen and Rocke's approach

Nguyen and Rocke (2002) substitute the data matrix *X* by an *n* × κ matrix

**y**,

*X*,

*I*,

*f*). Then they estimate the parameter in the ML sense by running IRLS (

**y**,

**y**is unappealing; in addition, the PLS-regression step does not take into account the heteroscedasticity of the response vector

**y**; finally, in many applications,

In practice, IRLS is stopped after a maximal number of iterations *n*_{max} thus hiding the non-convergence of IRLS. Unfortunately, the estimate

*n*

_{max}and this yields an unstable procedure for classification. We observed this phenomenon on the Leukemia data set.

*et al*., 1999); classification is performed on the samples from the test set. When

*p*= 150 and κ = 3, there are 1 (resp. 2) samples incorrectly classified if

*n*

_{max}= 7 (resp.

*n*

_{max}= 10).

#### Marx's approach

In Marx (1996) the parameter γ is estimated in the ML sense and is obtained at the convergence of IRLS(**y**,

*z*

^{(t)},

*X*,

*W*

^{(t)}, rank(

*E*

_{1})).

As recalled above, WPLS applied with the maximal number of PLS components is nothing else than weighted least squares (note that Marx chooses κ = rank(*E*_{1}) while in theory, κ_{max} should be strictly lower than rank(*E*_{1})). Hence IRPLS and IRLS coincide, and, when *X* is full row-rank (which is most often the case when *n* ≤ *p*), IRPLS never converges. In practice, IRPLS is stopped after a fixed number of iterations, thus hiding the non-convergence phenomenon. In addition, initializing IRPLS by choosing a linear predictor of the form η^{(0)} = *c*_{0}**y** − *c*_{0}(𝕀_{n} − **y**) [where for example *c*_{0} = ln(3)], as done by Marx, yields

*t*≥ 0,

*z*

^{(t)}= 2

*c*

_{t}

**y**−

*c*

_{t}𝕀

_{n}with

*c*

_{t}= 1 +

*c*

_{t−1}+ exp(−

*c*

_{t−1}), and

*W*

^{(t)}is proportional to the identity matrix I

_{n}. Since WPLS(

**y**,

*X*,

*W*, κ) = WPLS(α

**y**+ β 𝕀

_{n},

*X*,

*W*, κ)—in terms of the exhibited scores—for all α, β ∈ ℝ, WPLS (

*z*

^{(t)},

*X*,

*W*

^{(t)}, κ) returns the same scores as WPLS (

**y**,

*X*, I

_{n}, κ), thus proving

#### Ding and Gentleman's approach

The originality of their work (Ding and Gentleman, 2004) is that it simultaneously answers to the regularization question and to the dimension-reduction one. They run an approximation of a Newton–Raphson (NR) algorithm for solving a Firth's penalized ML criterion. As in IRLS, any iteration of the NR algorithm is a weighted least squares regression and Ding and Gentleman replace this least square regression by a WPLS one. We call this algorithm FPLS.

We run their method on the data sets described in the next section. On the colon data set and on the prostate data set, the algorithm does not always converge; we observe a cyclic behavior—after a burn-in period the path is periodic. The estimate

This approach is greatly promising since it addresses both the regularization and the dimension-reduction problems. Comparisons of our results with their approach are of interest and will be explored in future research.

#### Eilers *et al*.'s approach

Their method (Eilers *et al*., 2001) does not use PLS. We nevertheless mention their work since their estimate,

^{2}= I). The method of Eilers

*et al*. does not reduce the dimension but only deals with the regularization question. In particular, all the explanatory variables are allowed and included into the regression model, which can deteriorate the performances of the classifier. In the next section, we will outline the high interest of combining a reduction step with the Ridge regularization.

## RESULTS

We illustrate the interest of RPLS by considering applications for the classification of microarrays data. We compare the classification results from our procedure with those of other classifiers including RIRLS, FPLS, the effective dimension reduction (MAVE, Antoniadis *et al*., 2003), diagonal linear discriminant analysis (DLDA), diagonal quadratic discriminant analysis (DQDA) and *k*-nearest neighbors (KNN) based on the Euclidean distance [see Devroye *et al*. (1996) for an overview of the last three methods].

DLDA, DQDA and KNN are thus introduced in the present paper as ‘classical statistical methods’. As commented in the abstract, our goal is to show that these methods poorly behave when applied to high-dimensional data sets. This is exactly what happens, thus stressing the need for interest in methods based on regularization and dimension reduction.

In order to illustrate the interest of PLS over PCR in the regression framework, we compare our algorithm RPLS to ‘RPCR’ (for Ridge-PCR). By nature, PCR handles continuous responses; this calls for an extension of PCR to GLM, in order to use it as a dimension reduction in GLM. The extension we derived for PLS remains valid for PCR: we exhibit the continuous-valued pseudo-response variable at the convergence of the RIRLS algorithm and use this variable as the input variable for PCR. This yields RPCR.

### Data, pre-processing and gene selection

We will consider in turn the Leukemia, Colon and Prostate data sets.1 The Leukemia data set contains 72 tissue samples with *p*_{init} = 7129 genes: 47 cases of acute lymphoblastic leukemia (ALL), coded 0 and 25 cases of acute myeloid leukemia (AML), coded 1 (Golub *et al*., 1999). The Colon data set contains 62 tissue samples with *p*_{init} = 2000 genes: 40 tumors tissues, coded 1 and 22 normal tissues, coded 0 (Alon *et al*., 1999). The Prostate data set contains 102 tissue samples with *p*_{init} = 12600 genes: 52 tumors tissues, coded 1 and 50 normal tissues, coded 0 (Singh *et al*., 2002).

For Leukemia and Colon data (resp. Prostate), the pre-processing steps of Dudoit *et al*., 2002 [resp. Singh *et al*., 2002] are applied: thresholding [floor of 100 (resp. 10) and ceiling of 16 000] filtering [exclusion of genes with max/min ≤ 5 and (max − min) ≤ 500 (resp. 50)] log_{10}-transformation/standardization. Notice that the filtering step is applied using only the Learning set. This yields a resulting number of covariates *p*_{max} depending on the subdivision Learning and Testing set, lower than *p*_{init} but still far larger than the number of observations.

Although the procedures can handle a large number (thousands) of genes, the number of genes may still be too large for practical use. Furthermore, a considerable percentage of the genes do not show differential expression across groups and only a subset of genes are of interest. We perform the preliminary selection of gene based on the BSS/WSS criterion used in Dudoit *et al*. (2002). When training the rule for the selection of gene, we select *p* genes by the previous criterion with

### Assessing prediction methods

It is common to assess the performance of the classification rules for a selected subset of genes by their errors on the test set and also by their leave-one-out cross-validated errors. Due to the instability of leave-one-out error rates, we also perform a re-randomization study, i.e. an out-of-sample analysis of 100 random subdivisions of the data set into a learning set and a test set. When a test set is available, we randomly split the original data set into a training set and a test set of the same size as the original ones. Otherwise, we choose a test set size equal to one-third of the data [2:1 scheme of Dudoit *et al*. (2002)]. Each subdivision yields a test set error rate for each predictor; boxplots are used to summarize these error rates over the runs.

The optimal number of PLS or PCR components (for RPLS, FPLS or RPCR) is selected by choosing the value of κ minimizing leave-one-out error rates for the training set. This is also employed for other procedures that involve hyperparameters, such as MAVE or KNN. In practice, on the leave-one-out analyses performed on the Colon data sets and on the Prostate data sets, we observed many cases of indecisions for even values of *k*. This is the reason why, as suggested in Devroye *et al*. (1996) we run KNN for odd values of *k*. We really believe that the frequent occurrence of the indecision case shows that KNN is not a pertinent method (for this kind of data sets). The weakness of this classical statistical method is clearly illustrated by the numerical results.

The κ range is given by

_{10}-linearly spaced points in the range [10

^{−2}; 10

^{3}]. Note that, to fairly evaluate and compare the test or leave-one-out cross-validated errors, pre-processing, gene selection and (hyper)-parameter estimations are performed on the training set (at each step of the cross-validation process).

## DISCUSSION

Different numerical results are reported in Tables 1–3 and boxplots are plotted in Figures 1–3. In the tables, the number in brackets for RPLS, RPCR, FPLS and MAVE are the optimal numbers of components chosen as previously indicated and those for KNN are the optimal numbers of nearest neighbors. The numerical results and graphics show the necessity of the dimension-reduction step. This is particularly evident from the Colon and Prostate data results. Indeed note that most of the classifiers proposed in the literature behave well on the Leukemia data set though the other data sets are known to be more ‘problematic’. In particular, the boxplots suggest that errors rates for RPLS, RPCR and FPLS are typically lower and less variable. There is no obvious difference between the distributions of error rates for these three methods. However, we can mention that for Colon and Prostate data FPLS has converged only for small κ values; and that RPCR needs κ values greater than the one of RPLS. Otherwise these methods are robust to the growth of *p*, thanks to the dimension-reduction step (the larger *p* is, the larger κ has to be chosen to reach the best classification result except for FPLS which does not converge for large κ), and to an increasing value of the shrinkage parameter. The good performance of these methods when *p* = *p*_{max} (Table 2) is particularly interesting when applied to microarrays, since it can allow the practitioner to avoid a pre-selection step and thus makes the classification result independent of the criterion applied in this preliminary selection. On the other hand, the methods such as RIRLS, DLDA, DQDA or KNN have very poor performances when *p* gets large. Note that MAVE stands between the two although it is a dimension-reduction method.

Concerning the comparison between RPLS and RIRLS, as mentioned above, we do not trust RIRLS due to the non-scaling of the design matrix that makes the utility of the method problem-specific. It may be read in the tables and figures that RPLS and RIRLS have an equivalent behavior for ‘small’ *p* values. Nevertheless the latter is not robust to large *p*. This legitimately suggests adding to this method a dimension-reduction step; we observed on the three data sets, in the resampling analysis, that this would improve the RIRLS method.

RPLS confirms different analyses in the literature. For example, it is known that in the Leukemia data set, samples #28, 66 and 67 have a high misclassification rate (henceforth denoted MR[*i*] for sample #*i*) (Dudoit *et al*., 2002). In the resampling study, for

*p*both increase: for

*p*= 1000 and κ = 3 (resp.

*p*= 50 and κ = 1),

*MR*[28] = 7.02% and

*MR*[67] = 7.55% (resp. 38.60 and 39.62%). Another example is given by the Colon data set, for which samples

*N*8,34 and

*T*30,33,36 are misclassified by both the contributions (Alon

*et al*., 1999; Furey

*et al*., 2000); in the resampling analysis, performed for

*N*8,34 and

*T*36 are systematically misclassified,

*MR*[

*T*30] ≥ 88.89% and

*MR*[

*T*33] ≥ 96.87%. In addition, RPLS always misclassifies

*N*36 [(an example pointed out in Furey

*et al*., 2000)], and behaves poorly for samples

*T*2,33 [samples pointed out in Alon

*et al*. (1999)]. For the Prostate data set, in the leave-one-out study, the minimal number of misclassified samples is five (and is reached by RPLS), namely samples 32,64,68,84 and 92. Samples 32, 84 and 92 are misclassified for all of the LO analysis

*MR*[32] = 1,

*MR*[64] ≥ 0.41,

*MR*[68] ≥ 0.72,

*MR*[84] = 1 and

*MR*[92] ≥ 0.90.

## CONCLUSIONS

We have proposed a statistical dimension-reduction approach for the classification of tumors based on microarray gene expression data. Our method is designed to address the curse of dimensionality and overcome the problem of a high-dimensional gene expression space so common in such type of problems. We have provided a new extension of partial least squares to binary response data, that seems to have better properties than some of the currently used methods. We restricted our attention to the binary case, but the methodology can be extended to cover multi-class problems and we are interested in doing so. Indeed the structure of the algorithm for the binary case and multi-class case is the same, but the choice of the parameter λ necessitates more attention in the multi-class case than in the binary case. Future research will also concern the variable and model selection themes in order to determine optimal values for (λ, κ).

## APPENDIX: RPLS

For given (**y**, *X*), λ > 0 and κ ≥ 1.

Compute

*Z*← [𝕀_{n}*X*] and Σ as in Equation (4).RIRLS step:

- \[t\leftarrow 0.\]
- \[\begin{array}{c}{\eta }^{\left(t\right)}\leftarrow Z{\gamma }^{\left(t\right)}.\\ {\pi }^{\left(t\right)}\leftarrow [{\left(1+exp(-{\eta }_{k}^{\left(t\right)})\right)}^{-1},1\le k\le n{]}^{\prime }.\\ {W}^{\left(t\right)}\leftarrow \hbox{ diag }\left({\pi }^{\left(t\right)}\right(1-{\pi }^{\left(t\right)}\left)\right).\\ {z}^{\left(t\right)}\leftarrow {\eta }^{\left(t\right)}+{\left({W}^{\left(t\right)}\right)}^{-1}(\underset{-}{\hbox{ y }}-{\pi }^{\left(t\right)}).\\ {\gamma }^{(t+1)}\leftarrow {({Z}^{\prime }{W}^{\left(t\right)}Z+\lambda {\Sigma }^{2})}^{-1}{Z}^{\prime }{W}^{\left(t\right)}{z}^{\left(t\right)}.\\ t\leftarrow t+1.\end{array}\]
- \[\begin{array}{c}{z}^{\infty }\leftarrow {z}^{(t-1)}.\\ {W}^{\infty }\leftarrow {W}^{(t-1)}.\end{array}\]

WPLS step:

- \(\tilde{\Sigma }\leftarrow {\Sigma }_{2:p+1,2p+1,}{X}^{s}\leftarrow X{\tilde{\Sigma }}^{-1}.\).
*t*_{0}← 𝕀_{n},*E*_{0}←*X*^{s},*f*_{0}←*z*^{∞}, ω_{0}← 0_{ℝp}, ψ ← I_{p}.- \[\begin{array}{c}{q}_{k}\leftarrow {t}_{k}^{\prime }{W}^{\infty }{f}_{k}/\left({t}_{k}^{\prime }{W}^{\infty }{t}_{k}\right).\\ {p}_{k}\leftarrow {E}_{k}^{\prime }{W}^{\infty }{t}_{k}/\left({t}_{k}^{\prime }{W}^{\infty }{t}_{k}\right).\\ {f}_{k+1}\leftarrow {f}_{k}-{q}_{k}{t}_{k}.\\ {E}_{k+1}\leftarrow {E}_{k}-{t}_{k}{p}_{k}^{\prime }.\\ \psi \leftarrow \psi ({\hbox{ I }}_{p}-{\omega }_{k}{p}_{k}^{\prime }).\\ {\omega }_{k+1}\leftarrow {{E}^{\prime }}_{k+1}{W}^{\infty }{f}_{k+1}.\\ {\tilde{\psi }}_{k+1}\leftarrow \psi {\omega }_{k+1}.\\ {t}_{k+1}\leftarrow {E}_{k+1}{\omega }_{k+1}.\end{array}\]
Set

\(\tilde{\Psi }\leftarrow [{\tilde{\psi }}_{1}\dots {\tilde{\psi }}_{k}]\)and*q*← [*q*_{1}…*q*_{κ}]′.- \[\begin{array}{c}{\widehat{\gamma }}_{1}\leftarrow {q}_{0}-{p}_{0}^{\prime }\tilde{\Psi }q.\\ {\widehat{\gamma }}_{2:(p+1)}\leftarrow {\tilde{\Sigma }}^{-1}{\tilde{\Psi }}_{q}.\end{array}\]

The procedure, presently derived in ℝ^{p+1}, can be equivalently derived in ℝ^{r+1} where *r* + 1 = rank(*Z*) ≤ *n*. To that goal, compute *UDV*′, the singular values decomposition (svd) of (*X* − 𝕀_{n}

*X*/

*n*)

*r*columns of

*UD*in Ξ = (

*UD*)

_{·,1:r}so that

*Z*γ = [𝕀

_{n}Ξ]θ for some θ ∈ ℝ

^{r+1}. We denote by

*J*

^{(r)}a diagonal matrix with

*k*= 2,…,

*r*+ 1. It is readily seen that RPLS, run by replacing (

*X*, Σ

^{2}) by (Ξ,

*J*

^{(r)}), yields an estimate

*p*which is of computational importance.

^{1}They can be downloaded from http://www.sdmc.lit.org.sg/aGEDatasets/Datasets.html

**Table 1**

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | |

50 | 0 | 1 | 0(3) | 1 | 1(1) | 1 | 0(2) | 1 | 4(3) | 1 | 1 | 1 | 1 | 1 | 1(1) | 2 |

300 | 2 | 3 | 0(1) | 3 | 0(2) | 1 | 0(2) | 0 | 2(1) | 0 | 1 | 2 | 1 | 1 | 1(1) | 1 |

500 | 2 | 3 | 0(1) | 3 | 0(3) | 2 | 0(2) | 0 | 0(1) | 0 | 0 | 2 | 0 | 1 | 0(1) | 1 |

1000 | 2 | 3 | 0(2) | 2 | 0(4) | 2 | 0(2) | 0 | 1(1) | 0 | 0 | 2 | 0 | 2 | 0(1) | 1 |

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | LO | OS | |

50 | 0 | 1 | 0(3) | 1 | 1(1) | 1 | 0(2) | 1 | 4(3) | 1 | 1 | 1 | 1 | 1 | 1(1) | 2 |

300 | 2 | 3 | 0(1) | 3 | 0(2) | 1 | 0(2) | 0 | 2(1) | 0 | 1 | 2 | 1 | 1 | 1(1) | 1 |

500 | 2 | 3 | 0(1) | 3 | 0(3) | 2 | 0(2) | 0 | 0(1) | 0 | 0 | 2 | 0 | 1 | 0(1) | 1 |

1000 | 2 | 3 | 0(2) | 2 | 0(4) | 2 | 0(2) | 0 | 1(1) | 0 | 0 | 2 | 0 | 2 | 0(1) | 1 |

**Table 2**

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN |
---|---|---|---|---|---|---|---|---|

100 | 9 | 9(1) | 7(6) | 8(1)* | 12(1) | 17 | 17 | 7(5) |

500 | 10 | 8(3) | 8(5) | 8(1)* | 7(6) | 18 | 22 | 9(5) |

1000 | 15 | 7(3) | 7(6) | 8(1)* | 15(1) | 20 | 23 | 8(7) |

p_{max} | 17 | 7(3) | 7(6) | 8(1)* | 6(4) | 22 | 25 | 8(7) |

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN |
---|---|---|---|---|---|---|---|---|

100 | 9 | 9(1) | 7(6) | 8(1)* | 12(1) | 17 | 17 | 7(5) |

500 | 10 | 8(3) | 8(5) | 8(1)* | 7(6) | 18 | 22 | 9(5) |

1000 | 15 | 7(3) | 7(6) | 8(1)* | 15(1) | 20 | 23 | 8(7) |

p_{max} | 17 | 7(3) | 7(6) | 8(1)* | 6(4) | 22 | 25 | 8(7) |

Leave-one-out analysis performed on 62 subdivisions of the data set into a learning set (resp. test set) of cardinal 61 (resp. cardinal 1). * means that during the leave-one-out procedure, for a given κ in the range

**Table 3**

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN |
---|---|---|---|---|---|---|---|---|

100 | 9 | 7(3) | 6(8) | 8(2)* | 51(1) | 11 | 11 | 7(3) |

500 | 10 | 8(2) | 9(6) | 8(2)* | 7(2) | 21 | 18 | 8(13) |

1000 | 10 | 5(3) | 5(13) | 8(2)* | 8(4) | 28 | 24 | 10(3) |

1500 | 10 | 7(4) | 5(12) | 10(2)* | 14(4) | 31 | 28 | 12(3) |

p | RIRLS | RPLS | RPCR | FPLS | MAVE | DLDA | DQDA | KNN |
---|---|---|---|---|---|---|---|---|

100 | 9 | 7(3) | 6(8) | 8(2)* | 51(1) | 11 | 11 | 7(3) |

500 | 10 | 8(2) | 9(6) | 8(2)* | 7(2) | 21 | 18 | 8(13) |

1000 | 10 | 5(3) | 5(13) | 8(2)* | 8(4) | 28 | 24 | 10(3) |

1500 | 10 | 7(4) | 5(12) | 10(2)* | 14(4) | 31 | 28 | 12(3) |

Leave-one-out analysis performed on 102 subdivisions of the data set into a learning set (resp. test set) of cardinal 101 (resp. cardinal 1). * has the same meaning as in Table 2.

**Fig. 1**

**Fig. 1**

**Fig. 2**

**Fig. 2**

**Fig. 3**

**Fig. 3**

The authors are really grateful to A. Antoniadis for constructive and fruitful discussions and to the referees for their constructive comments and criticisms which have substantially improved this article. They would like also thank I. De Feis for helpful comments and B. Ding and R. Gentleman for providing a preprint of their paper prior to publication.

Part of this work was supported by the research project ASBGEN and the Interuniversity Attraction Pole (IAP) research network in Statistics P5/24.

## Comments