## Abstract

Motivation: One important aspect of data-mining of microarray data is to discover the molecular variation among cancers. In microarray studies, the number n of samples is relatively small compared to the number p of genes per sample (usually in thousands). It is known that standard statistical methods in classification are efficient (i.e. in the present case, yield successful classifiers) particularly when n is (far) larger than p. This naturally calls for the use of a dimension reduction procedure together with the classification one.

Results: In this paper, the question of classification in such a high-dimensional setting is addressed. We view the classification problem as a regression one with few observations and many predictor variables. We propose a new method combining partial least squares (PLS) and Ridge penalized logistic regression. We review the existing methods based on PLS and/or penalized likelihood techniques, outline their interest in some cases and theoretically explain their sometimes poor behavior. Our procedure is compared with these other classifiers. The predictive performance of the resulting classification rule is illustrated on three data sets: Leukemia, Colon and Prostate.

Availability: Software that implements the procedures and data source on which this paper focuses are freely available at http://www-lmc.imag.fr/SMS/membres/Gersende_Fort,Sophie_Lambert.html

Contact:sophie.lambert@imag.fr

## INTRODUCTION

Microarray technology generates a vast amount of data by measuring, through the hybridization process, the levels of virtually all the genes expressed in a biological sample. One can expect that knowledge gleaned from microarray data will contribute significantly to advances in fundamental questions in biology as well as in clinical medicine. One important goal of analyzing microarray data is to classify the samples. To cite a few, Golub et al. (1999) have considered classification of acute leukemia and Alon et al. (1999) have addressed the cluster analysis of tumor and normal colon tissues. The approaches developed in these papers consist in discrimination methods and machine learning methods [see Dudoit et al. (2002) for a comparative study].

In microarray studies, the number of samples, n, is relatively small compared to the number of genes, p, usually in thousands. Unless a preliminary variable selection step is performed, standard statistical methods in classification perform poorly because there are far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. For instance, the pooled within-class sample covariance matrix in Fisher's linear discriminant function is singular if n < p + 2. Even if all genes can be used as in support vector machines, it seems to be not sensible to use all the genes. Indeed, this use allows the presence of the noise associated with genes of little or no discrimination power. That inhibits and degrades the performances of the classification rules in their application to unclassified tumor. In this situation, dimension reduction is needed to reduce the high p-dimensional gene space. In most previously mentioned works, the authors have used univariate methods for reducing the number of genes. Alternative approaches to handle the dimension reduction problem can also be used (see for instance Ghosh, 2002; Nguyen and Rocke, 2002; West et al., 2001; Antoniadis et al., 2003).

Similar data structures have been encountered in the field of chemometrics. The method of partial least squares (PLS) (Wold, 1975; Naes and Martens, 1985; Helland, 1988) has been found to be a useful dimension reduction technique as well as principal component regression (PCR) [Massy, 1965; see Frank and Friedman (1993) for a statistical view of PLS and PCR]. In the context of microarrays, the purpose of PCR is to produce orthogonal tumor descriptors that reduce the dimension to only a few gene components (super-genes) (West et al., 2001). But the dimension reduction is achieved without regard to the response variable and may be inefficient. This is the reason why PLS looks more adapted than PCR for dimension reduction based prediction. Indeed, PLS components are chosen so that the sample covariance between the response and a linear combination of the p predictors (genes) is maximum.

Nguyen and Rocke (2002) proposed using PLS for dimension reduction as a preliminary step to classification, based either on linear logistic discrimination, or linear or quadratic discriminant analysis. However, this seems to be intuitively unappealing because PLS is really designed to handle continuous responses and models that do not suffer from heteroscedasticity as it is the case for Bernoulli or multinomial data. Furthermore, in practice we have observed problems in the convergence of the iteratively reweighted least squares (IRLS) algorithm, which is the usual procedure for solving the maximum likelihood (ML) equation in the field of the generalized linear models (GLM). Indeed, for logistic regression, it is well known that convergence poses a long standing problem. Infinite parameter estimates can occur depending on the configuration of the sample points in the observation space (Albert and Anderson, 1984).

Marx (1996) proposed an extension of PLS to a categorical response variable and illustrated the developments from a spectroscopy example. His approach embedded the usual PLS steps within the IRLS. Unfortunately, we have observed that this algorithm does not converge in many cases of interest (such as in the applications considered in this paper). More recently, Ding and Gentleman (2004, http://www.bepress.com/bioconductor/) proposed an approach based on this procedure. They phrased the problem in a GLM setting and applied Firth's procedure to avoid (quasi)separation.

To deal with the high-dimension problem, another approach consists in penalizing the likelihood. Eilers et al. (2001) propose useing the Ridge penalized logistic regression in order to both stabilize the statistical problem and remove numerical degeneracy due to multicollinearity. They have shown that this method appears to work well with microarray data. Note that this method is not a dimension-reduction technique. Indeed all explanatory variables are allowed into the regression model. From the log-likelihood a so-called Ridge penalty is subtracted. All the genes contribute, which can inhibit and degrade the performances of the classification rules. Note that we can find alternative approaches (see, for example, Huang and Pan, 2003; Ghosh, 2003) for which the classification problem is not viewed as a problem in a logistic regression.

In this paper, we extend the PLS method to a binary response variable. To do that, we want to substitute the categorical response variable in the input of PLS by a continuous-valued pseudo-response variable whose expected value has a linear relationship with the covariates. The limiting pseudo-response variable in the IRLS algorithm seems to be a good candidate. Unfortunately, in the present situation, ‘small n, large p’, IRLS no longer works since the limiting pseudo-response variable is, in norm, infinite. The idea developed here is to penalize with a Ridge penalty the likelihood criterion in order to constrain the pseudo-response variable to be finite. That is, our procedure combines a Ridge penalty step and a PLS step and the dimension-reduction step is incorporated in the classification step. Here we present the classification rule for the binary response variable indicating a normal or colon tumor, for instance. Nevertheless, our approach remains valid for multi-categorical response variables. But the binary case is the simplest case which allows us to point out whether such a procedure works well or not and why.

This paper is organized as follows. The Methods section is the methodological part of this paper. It contains a description of the logistic regression and linear discrimination. We then recall the Ridge regression method and derive a weighted PLS algorithm in order to address the dimension reduction in heteroscedastic models. We then introduce an extension of PLS to GLM based on the Ridge penalty, and analyze the Nguyen and Rocke, Marx, Ding and Gentleman and Eilers et al. algorithms. Applications to disease classification through microarray are presented in the Results section.

## METHODS

### Some basic ingredients

After introducing some notations, we recall the principle of linear logistic discrimination, some results on the existence of the maximum likelihood estimator and the classical algorithm used to compute it. Next, we present a regularization method, a penalized maximum likelihood method, and a dimension-reduction technique, PLS.

### Notations

Expression levels of the p genes for the n microarray samples are collected in an n × p data matrix X = (xi/j), 1 ≤ in, 1 ≤ jp. The entry xi/j is the expression level of the variable ‘gene’ j in the microarray sample i, and the i-th row Xi is the vector of a gene expression profile for sample i. More generally, for a matrix A, Ai,j denotes the entry (i, j), A·,j (resp. Ai) denotes the column vector collecting the column #j (resp. the row #i). Ai1:i2,j1:j2 is the (i2i1 + 1) × (j2j1 + 1) matrix formed by picking out the rows i1 to i2 and columns j1 to j2 of A; A·,j1:j2 is formed by picking out the columns j1 to j2 of A. The labels of the n microarray samples are collected in a {0,…, (g − 1)}n-valued vector y. In supervised machine learning, each sample is thought to originate from a specific class k ∈ {0,…,g − 1} where the number of possible classes g is known and fixed. A classifier can be regarded as a function G: ℝp → {0,…,g−1} that predicts the unknown class label of a new tissue sample x ∈ℝp by G(x). We assume that the data (y, X) collect observations of n statistically independent and identically distributed random pairs (Y, X). We choose a logit model for the data (see, e.g. Fahrmeir and Tutz, 2001), and the logistic discrimination (LD) method for the classification procedure (see, e.g. Timm, 2002). In the terminology of the regression analysis, (X·,j)1≤jp are the predictor variables and (yi)1≤in the response variables. We include an intercept into the regression model, and denote by Z = [𝕀nX] the design matrix of size n × (p + 1), where 𝕀n = (1,…,1)′ stands for the column vector of length n (′ denotes the transposition operator).

### Linear LD

In logit models, the conditional class probability—or equivalently, the conditional expectation of Y given X—P(Y = 1|X = x;γ) is related to x and some parameter γ ∈ ℝp+1 through the relation P(Y = 1| X = x;γ) = h([1 x′]γ) where h(η) = 1/[1 + exp(−η)]. The quantity [1 x′] γ is called the linear predictor. γ is an unknown parameter that has to be estimated from the data. In LD, it is usually estimated by

$${\widehat{\gamma }}^{\hbox{ ML }}$$
, the ML estimator. The log-likelihood of the observations for the value γ of the parameter, simply denoted by l(γ), is given by
(1)
$l\left(\gamma \right)={\displaystyle \sum _{i=1}^{n}}\left\{{\underset{-}{\hbox{ y }}}_{i}{\eta }_{i}\left(\gamma \right)-ln[1+exp({\eta }_{i}\left(\gamma \right)\left)\right]\right\},$
where for all 1 ≤ in, ηi(γ) = (Zγ)i.

For a vector z = [1 x′], the predicted class

$$\widehat{y}$$
of each sample is 1 if
$$\widehat{\pi } > 1-\widehat{\pi }$$
and 0 otherwise, where
$$\widehat{\pi }=h\left({z}^{\prime }{\widehat{\gamma }}^{\hbox{ ML }}\right)$$
. Nevertheless, as discussed below, in some cases, including in practice the case np, the existence and unicity of
$${\widehat{\gamma }}^{\hbox{ ML }}$$
for logit models is not guaranteed.

### Maximum likelihood estimate and IRLS

We say that the ML estimate exists if there exists γ ∈ ℝp+1 of finite norm which is a maximizer of the concave log-likelihood l. Hence, such an estimate is a solution to the normal equation Z′(y − π(γ)) = 0, where π(γ) is the ℝn-valued mean vector with coordinates πi(γ) = hi(γ)).

If Z is full column-rank, the solution, when exists, is unique. The existence of a solution, when Z is full column-rank, depends on the configuration of the n sample points in the observation space ℝp (Albert and Anderson, 1984; Santner and Duffy, 1986). There are three exclusive situations: separate, quasi-separate and overlap. In the first two cases, there exists

$$\widehat{\gamma }$$
such that
$${\left(Z\widehat{\gamma }\right)}_{i}\ge 0$$
for all i such that yi = 1 and
$${\left(Z\widehat{\gamma }\right)}_{i}\le 0$$
for all i such that yi = 0; roughly speaking, this means that there exists a hyperplane that exactly separates the two classes, except maybe some points that can belong to the hyperplane. In such a case, l reaches its maximum as ‖γ‖ tends to +∞ and the ML estimate does not exist. In the third case, the estimate exists and is computed as the limit of a converging Newton–Raphson sequence; this algorithm is known as the IRLS algorithm (Green, 1984). Let W(γ) be the diagonal n × n matrix with diagonal entries Wi,i(γ) = πi(γ)[1 − πi(γ)]. Each iteration divides into two steps,
(2)
${z}^{\left(t\right)}=Z{\gamma }^{\left(t\right)}+{\left[{W}^{\left(t\right)}\right]}^{-1}\left(\underset{-}{\hbox{ y }}-{\pi }^{\left(t\right)}\right),$

(3)
${\gamma }^{(t+1)}={\left({Z}^{\prime }{W}^{\left(t\right)}Z\right)}^{-1}{Z}^{\prime }{W}^{\left(t\right)}{z}^{\left(t\right)},$
where W(t) and π(t) are shorthand notations for W(t)) and π(γ(t)). IRLS can thus be considered as an iterative weighted least square regression of an ℝn-valued pseudo-variable z(t) onto the columns of Z. We denote this algorithm by IRLS (y, x)

When Z is not full column-rank, the parameter is not identifiable and the ML estimate is not unique when exists; applying the above iterations [Equations (2) and (3)] by replacing the inverse matrix (3) with the Moore–Penrose pseudo-inverse, yields the parameter estimate which is of minimal norm among all the solutions. In practice, in the present statistical framework np, n = rank(Z) and the minimal norm solution verifies for all 1 ≤ in, (Zγ)i = ln(yi) − ln(1 − yi); it is thus of infinite norm and the ML estimate cannot exist. This calls for regularization methods.

### Ridge penalty and RIRLS

The Ridge estimator (Le Cessie and Van Houwelingen, 1992

$${\widehat{\gamma }}^{R}$$
is defined as the (unique) maximizer of the penalized likelihood l*(γ) = l(γ) − 0.5λγ′Σ2γ, where λ > 0 is the shrinkage parameter, and Σ2 is a diagonal matrix with entries
$${\Sigma }_{1,1}^{2}=0$$
and
(4)
$\begin{array}{cc}{\Sigma }_{j,j}^{2}={\displaystyle \sum _{i=1}^{n}}{({Z}_{i,j}-{\mathbb{I}}_{\prime }^{n}Z.{,}_{j}/n)}^{2},& j\in \{2,\dots p+1\}.\end{array}$
The weighted penalty term takes into account the non-scaling of the covariate matrix X, and does not apply to the location parameter γ1.
$${\widehat{\gamma }}^{R}$$
always exists, is unique and is computed as the limit of a Newton–Raphson sequence. We denote by RIRLS(y, X, λ) (shorthand notation for Ridge-IRLS) this algorithm. It consists in replacing in IRLS, the weighted regression (3) by a weighted Ridge regression γ(t+1) = (ZW(t)Z + λΣ2)−1ZW(t)z(t), where z(t) is built as in Equation (2).

λ controls the amount of shrinkage in the data and can be chosen as the minimum, over a given range, of the BIC criterion −2

$$-2l\left({\widehat{\gamma }}^{R}\right)+log\left(n\right)\hbox{ trace }{\left[Z\right({Z}^{\prime }W{\widehat{\gamma }}^{R})Z+\lambda {\Sigma }^{2})}^{-1}\times {Z}^{\prime }W\left({\widehat{\gamma }}^{R}\right)]$$
(Kass and Raftery, 1995).

### Weighted PLS

PLS is both a tool for linear regression and a tool for dimension reduction (Wold, 1975Naes and Martens, 1985Helland, 1988). Let y ∈ ℝn be a response vector, X be an n × p data matrix and W be a symmetric positive definite n × n matrix. PLS (i) defines κ W-orthogonal scores (tk)1≤k≤κ, linear combinations of the columns of Z such that for all k,

$${\mathbb{I}}_{n}^{\prime }W{t}_{k}=0$$
and (ii) performs a W-weighted least squares regression of y on (𝕀n, t1,…,tκ). This yields the decomposition
$\underset{-}{\hbox{ y }}={q}_{0}{\mathbb{I}}_{n}+{q}_{1}{t}_{1}+\dots {q}_{\kappa }{t}_{\kappa }+{f}_{\kappa +1}=Z{\widehat{y}}^{\hbox{ PLS },\kappa }+{f}_{\kappa +1}$
where the residual term fκ+1 is W-orthogonal to the vectors (𝕀n, t1,…,tκ). Contrary to classical dimension-reduction methods (such as PCR), the scores depend on the response vector y; roughly speaking, given (tk)1≤kl, tl+1 is the linear combination of the columns of Z, i.e. is of the form tl+1 = Zc, which is the most informative on the residual response variable fl+1, when information is defined in terms of the weighted covariance |Cov(
$$\sqrt{W}$$
Zc,
$$\sqrt{W}$$
fl+1)|(
$$\sqrt{W}$$
denotes the square root matrix of W) (Helland, 1988). While the maximal number of PLS scores κmax can be lower than rank(X), in practice, it is often equal to rank(X). Helland (1988) shows that the weighted PLS (WPLS) regression applied with κ = κmax is nothing more than the weighted least squares regression. In the literature, PLS is usually derived with W = I, the identity matrix; we thus detail the algorithm in the weighted case. Let
$$\tilde{\Sigma }$$
be the p × p positive-definite diagonal matrix with diagonal entries Σj,j, j ≥ 2, given by Equation (4).Hereafter, this procedure is denoted by WPLS (y, X, W, κ). If Z is full column-rank, this algorithm determines a unique estimate
$${\widehat{\gamma }}^{\hbox{ PLS },\kappa }$$
satisfying
$${\underset{-}{\hbox{ y }}-f}_{k+1}=Z{\widehat{\gamma }}^{\hbox{ PLS },\kappa }$$
; if Z is not full column-rank, the procedure above yields the minimal norm vector among all the vectors verifying yfk+1 = Zγ.

1. $${X}^{s}=X{\tilde{\Sigma }}^{-1}$$
, t0 = 𝕀n, E0 = Xs; f0 = y.

2. For k = 0, …, κ,

$\begin{array}{c}{q}_{k}={t}_{k}^{\prime }W{f}_{k}/\left({t}_{k}^{\prime }W{t}_{k}\right),\hbox{ \hspace{1em} }{f}_{k+1}={f}_{k}-{q}_{k}{t}_{k},\\ {E}_{k+1}={E}_{k}-{t}_{k}{t}_{k}^{\prime }W{E}_{k}/\left({t}_{k}^{\prime }W{t}_{k}\right),\\ {t}_{k+1}={E}_{k+1}{E}_{k+1}^{\prime }W{f}_{k+1}.\end{array}$

### Ridge PLS

A direct application of PLS to GLM seems to be intuitively unappealing because PLS handles continuous responses. This is the reason why, in order to extend PLS to GLM, we want to replace the binary vector y with a pseudo-response variable whose expected value has a linear relationship with the covariates. The pseudo-response variable z at the convergence of RIRLS(y, X, λ) verifies this condition and is thus our candidate: it is of the form

$${z}^{\infty }=Z{\widehat{\gamma }}^{\hbox{ R }}+\epsilon$$
, where, conditional to
$${\widehat{\gamma }}^{\hbox{ R }}$$
being the true value of the parameter, ε is a centered vector of covariance matrix (W)−1. The main advantage of choosing z instead of, for example, the pseudo-variable at the convergence of IRLS—which has a linear structure too—is that this allows the combination of a regularization step and of a dimension-reduction step. In addition, this extension is always well defined: recall indeed that in some cases (including the case np), the ML estimate does not exist so that the pseudo-variable ‘at convergence’ of IRLS is of infinite norm.

As a consequence, we propose a new procedure which combines Ridge penalty (the regularization step) and PLS (the dimension-reduction step) the so-called Ridge PLS (RPLS). Let λ be some positive real constant and κ be some positive integer. RPLS divides in two steps:A detailed implementation is given in the Appendix. The first step builds a continuous response variable z for the input of PLS, the ‘dispersion matrix’ of which is [W]−1. This explains the call, in the second step, to a weighted PLS procedure with weight W. The use of Xs in WPLS and of Σ in the penalized ridge criterion makes our procedure invariant to the scaling of the data matrix.

1. (z, W) ← RIRLS(y, X, λ);

2. $${\widehat{\gamma }}^{\hbox{ PLS },\kappa }\leftarrow \hbox{ WPLS }({z}^{\infty },X,{W}^{\infty },\kappa ).$$
.

RPLS depends on two parameters, λ and κ. λ is determined at the end of Step 1, as minimizing the BIC criterion (see the Ridge penalty section), and thus independently of κ. In the linear regression setting, the optimal choice of κ when dimension reduction is achieved by PLS, is to our best knowledge, an open problem: the non-linear dependence of

$${\widehat{\gamma }}^{\hbox{ PLS },\kappa }$$
upon the response vector makes an explicit control of the error term fκ+1 impossible. Finally, observe that RPLS provides an estimate
$${\widehat{\gamma }}^{\hbox{ RPLS }}$$
(which is unique, given y, X, λ and κ).

We are now able to provide an answer to the classification problem in a high-dimensional setting: our classification procedure consists in applying LD with the estimate

$${\widehat{\gamma }}^{\hbox{ RPLS }}$$
.

### Comparison with other approaches

We briefly review some regression procedures that use PLS as the dimension-reduction tool to manage the high-dimensional setting. We outline their interest and in some cases, explain their poor behavior.

#### Nguyen and Rocke's approach

Nguyen and Rocke (2002) substitute the data matrix X by an n × κ matrix

$$\tilde{X}$$
, the columns of which are the first κ PLS-scores given by WPLS (y, X, I, f). Then they estimate the parameter in the ML sense by running IRLS (y,
$$\tilde{X}$$
). This yields
$${\widehat{\gamma }}^{\hbox{ NR }}$$
. As mentioned above, applying PLS with a binary input y is unappealing; in addition, the PLS-regression step does not take into account the heteroscedasticity of the response vector y; finally, in many applications,
$$\Vert {\widehat{\gamma }}^{\hbox{ NR }}\Vert =\infty$$
since the ML estimate does not exist.

In practice, IRLS is stopped after a maximal number of iterations nmax thus hiding the non-convergence of IRLS. Unfortunately, the estimate

$${\widehat{\gamma }}^{\hbox{ NR }}$$
depends on nmax and this yields an unstable procedure for classification. We observed this phenomenon on the Leukemia data set.
$${\widehat{\gamma }}^{\hbox{ NR }}$$
is estimated by using the data in the Golub's training set (Golub et al., 1999); classification is performed on the samples from the test set. When p = 150 and κ = 3, there are 1 (resp. 2) samples incorrectly classified if nmax = 7 (resp. nmax = 10).

#### Marx's approach

In Marx (1996) the parameter γ is estimated in the ML sense and is obtained at the convergence of IRLS(y,

$$\tilde{X}$$
), where
$$\tilde{X}$$
is defined by IRPLS, an algorithm that extends PLS to GLM. More precisely, IRPLS can be understood as an IRLS algorithm in which the weighted least squares regression (3) is replaced with the WPLS regression, WPLS(z(t), X, W(t), rank(E1)).
$$\tilde{X}$$
collects the first κ components ‘at convergence’ of IRPLS.

As recalled above, WPLS applied with the maximal number of PLS components is nothing else than weighted least squares (note that Marx chooses κ = rank(E1) while in theory, κmax should be strictly lower than rank(E1)). Hence IRPLS and IRLS coincide, and, when X is full row-rank (which is most often the case when np), IRPLS never converges. In practice, IRPLS is stopped after a fixed number of iterations, thus hiding the non-convergence phenomenon. In addition, initializing IRPLS by choosing a linear predictor of the form η(0) = c0yc0(𝕀ny) [where for example c0 = ln(3)], as done by Marx, yields

$${\widehat{\gamma }}^{\hbox{ M }}={\widehat{\gamma }}^{\hbox{ NR }}$$
. A trivial induction shows that for all t ≥ 0, z(t) = 2ctyct 𝕀n with ct = 1 + ct−1 + exp(−ct−1), and W(t) is proportional to the identity matrix In. Since WPLS(y, X, W, κ) = WPLS(αy + β 𝕀n, X, W, κ)—in terms of the exhibited scores—for all α, β ∈ ℝ, WPLS (z(t), X, W(t), κ) returns the same scores as WPLS (y, X, In, κ), thus proving
$${\widehat{\gamma }}^{\hbox{ M }}={\widehat{\gamma }}^{\hbox{ NR }}$$
.

#### Ding and Gentleman's approach

The originality of their work (Ding and Gentleman, 2004) is that it simultaneously answers to the regularization question and to the dimension-reduction one. They run an approximation of a Newton–Raphson (NR) algorithm for solving a Firth's penalized ML criterion. As in IRLS, any iteration of the NR algorithm is a weighted least squares regression and Ding and Gentleman replace this least square regression by a WPLS one. We call this algorithm FPLS.

We run their method on the data sets described in the next section. On the colon data set and on the prostate data set, the algorithm does not always converge; we observe a cyclic behavior—after a burn-in period the path is periodic. The estimate

$${\widehat{\gamma }}^{\hbox{ DG }}$$
, and consequently the classification rule, may depend on the maximal number of iterations.

This approach is greatly promising since it addresses both the regularization and the dimension-reduction problems. Comparisons of our results with their approach are of interest and will be explored in future research.

#### Eilers et al.'s approach

Their method (Eilers et al., 2001) does not use PLS. We nevertheless mention their work since their estimate,

$${\widehat{\gamma }}^{\hbox{ DG }}$$
is the Ridge-penalized ML estimate (with an un-weighted penalty term i.e. Σ2 = I). The method of Eilers et al. does not reduce the dimension but only deals with the regularization question. In particular, all the explanatory variables are allowed and included into the regression model, which can deteriorate the performances of the classifier. In the next section, we will outline the high interest of combining a reduction step with the Ridge regularization.

## RESULTS

We illustrate the interest of RPLS by considering applications for the classification of microarrays data. We compare the classification results from our procedure with those of other classifiers including RIRLS, FPLS, the effective dimension reduction (MAVE, Antoniadis et al., 2003), diagonal linear discriminant analysis (DLDA), diagonal quadratic discriminant analysis (DQDA) and k-nearest neighbors (KNN) based on the Euclidean distance [see Devroye et al. (1996) for an overview of the last three methods].

DLDA, DQDA and KNN are thus introduced in the present paper as ‘classical statistical methods’. As commented in the abstract, our goal is to show that these methods poorly behave when applied to high-dimensional data sets. This is exactly what happens, thus stressing the need for interest in methods based on regularization and dimension reduction.

In order to illustrate the interest of PLS over PCR in the regression framework, we compare our algorithm RPLS to ‘RPCR’ (for Ridge-PCR). By nature, PCR handles continuous responses; this calls for an extension of PCR to GLM, in order to use it as a dimension reduction in GLM. The extension we derived for PLS remains valid for PCR: we exhibit the continuous-valued pseudo-response variable at the convergence of the RIRLS algorithm and use this variable as the input variable for PCR. This yields RPCR.

### Data, pre-processing and gene selection

We will consider in turn the Leukemia, Colon and Prostate data sets.1 The Leukemia data set contains 72 tissue samples with pinit = 7129 genes: 47 cases of acute lymphoblastic leukemia (ALL), coded 0 and 25 cases of acute myeloid leukemia (AML), coded 1 (Golub et al., 1999). The Colon data set contains 62 tissue samples with pinit = 2000 genes: 40 tumors tissues, coded 1 and 22 normal tissues, coded 0 (Alon et al., 1999). The Prostate data set contains 102 tissue samples with pinit = 12600 genes: 52 tumors tissues, coded 1 and 50 normal tissues, coded 0 (Singh et al., 2002).

For Leukemia and Colon data (resp. Prostate), the pre-processing steps of Dudoit et al., 2002 [resp. Singh et al., 2002] are applied: thresholding [floor of 100 (resp. 10) and ceiling of 16 000] filtering [exclusion of genes with max/min ≤ 5 and (max − min) ≤ 500 (resp. 50)] log10-transformation/standardization. Notice that the filtering step is applied using only the Learning set. This yields a resulting number of covariates pmax depending on the subdivision Learning and Testing set, lower than pinit but still far larger than the number of observations.

Although the procedures can handle a large number (thousands) of genes, the number of genes may still be too large for practical use. Furthermore, a considerable percentage of the genes do not show differential expression across groups and only a subset of genes are of interest. We perform the preliminary selection of gene based on the BSS/WSS criterion used in Dudoit et al. (2002). When training the rule for the selection of gene, we select p genes by the previous criterion with

$$p\in {\mathcal{P}}_{l}=\{50,300,500,1000\}$$
for Leukemia data,
$$p\in {\mathcal{P}}_{c}=\{100,500,1000,{p}_{max}\}$$
for Colon data and
$$p\in {\mathcal{P}}_{p}=\{100,500,1000,1500\}$$
for Prostate data.

### Assessing prediction methods

It is common to assess the performance of the classification rules for a selected subset of genes by their errors on the test set and also by their leave-one-out cross-validated errors. Due to the instability of leave-one-out error rates, we also perform a re-randomization study, i.e. an out-of-sample analysis of 100 random subdivisions of the data set into a learning set and a test set. When a test set is available, we randomly split the original data set into a training set and a test set of the same size as the original ones. Otherwise, we choose a test set size equal to one-third of the data [2:1 scheme of Dudoit et al. (2002)]. Each subdivision yields a test set error rate for each predictor; boxplots are used to summarize these error rates over the runs.

The optimal number of PLS or PCR components (for RPLS, FPLS or RPCR) is selected by choosing the value of κ minimizing leave-one-out error rates for the training set. This is also employed for other procedures that involve hyperparameters, such as MAVE or KNN. In practice, on the leave-one-out analyses performed on the Colon data sets and on the Prostate data sets, we observed many cases of indecisions for even values of k. This is the reason why, as suggested in Devroye et al. (1996) we run KNN for odd values of k. We really believe that the frequent occurrence of the indecision case shows that KNN is not a pertinent method (for this kind of data sets). The weakness of this classical statistical method is clearly illustrated by the numerical results.

The κ range is given by

$${\mathcal{K}}_{\hbox{ l }}=\{1,\dots 8\}$$
for Leukemia data,
$${\mathcal{K}}_{\hbox{ c }}=\{1,\dots 9\}$$
for Colon data and
$${\mathcal{K}}_{\hbox{ p }}=\{1,\dots 14\}$$
for Prostate data. Moreover, the shrinkage parameter (for RIRLS, RPLS or RPCR) is determined as mentioned above on 51 log10-linearly spaced points in the range [10−2; 103]. Note that, to fairly evaluate and compare the test or leave-one-out cross-validated errors, pre-processing, gene selection and (hyper)-parameter estimations are performed on the training set (at each step of the cross-validation process).

## DISCUSSION

Different numerical results are reported in Tables 13 and boxplots are plotted in Figures 13. In the tables, the number in brackets for RPLS, RPCR, FPLS and MAVE are the optimal numbers of components chosen as previously indicated and those for KNN are the optimal numbers of nearest neighbors. The numerical results and graphics show the necessity of the dimension-reduction step. This is particularly evident from the Colon and Prostate data results. Indeed note that most of the classifiers proposed in the literature behave well on the Leukemia data set though the other data sets are known to be more ‘problematic’. In particular, the boxplots suggest that errors rates for RPLS, RPCR and FPLS are typically lower and less variable. There is no obvious difference between the distributions of error rates for these three methods. However, we can mention that for Colon and Prostate data FPLS has converged only for small κ values; and that RPCR needs κ values greater than the one of RPLS. Otherwise these methods are robust to the growth of p, thanks to the dimension-reduction step (the larger p is, the larger κ has to be chosen to reach the best classification result except for FPLS which does not converge for large κ), and to an increasing value of the shrinkage parameter. The good performance of these methods when p = pmax (Table 2) is particularly interesting when applied to microarrays, since it can allow the practitioner to avoid a pre-selection step and thus makes the classification result independent of the criterion applied in this preliminary selection. On the other hand, the methods such as RIRLS, DLDA, DQDA or KNN have very poor performances when p gets large. Note that MAVE stands between the two although it is a dimension-reduction method.

Concerning the comparison between RPLS and RIRLS, as mentioned above, we do not trust RIRLS due to the non-scaling of the design matrix that makes the utility of the method problem-specific. It may be read in the tables and figures that RPLS and RIRLS have an equivalent behavior for ‘small’ p values. Nevertheless the latter is not robust to large p. This legitimately suggests adding to this method a dimension-reduction step; we observed on the three data sets, in the resampling analysis, that this would improve the RIRLS method.

RPLS confirms different analyses in the literature. For example, it is known that in the Leukemia data set, samples #28, 66 and 67 have a high misclassification rate (henceforth denoted MR[i] for sample #i) (Dudoit et al., 2002). In the resampling study, for

$$\kappa \in {\mathcal{K}}_{1}$$
and
$$p\in {\mathcal{P}}_{1}$$
, RPLS systematically misclassifies sample #66, whereas MR[28] and MR[67] decrease when κ and p both increase: for p = 1000 and κ = 3 (resp. p = 50 and κ = 1), MR[28] = 7.02% and MR[67] = 7.55% (resp. 38.60 and 39.62%). Another example is given by the Colon data set, for which samples N8,34 and T30,33,36 are misclassified by both the contributions (Alon et al., 1999; Furey et al., 2000); in the resampling analysis, performed for
$$\kappa \in {\mathcal{K}}_{c}$$
,
$$p\in {\mathcal{P}}_{c}$$
, N8,34 and T 36 are systematically misclassified, MR[T30] ≥ 88.89% and MR[T33] ≥ 96.87%. In addition, RPLS always misclassifies N36 [(an example pointed out in Furey et al., 2000)], and behaves poorly for samples T2,33 [samples pointed out in Alon et al. (1999)]. For the Prostate data set, in the leave-one-out study, the minimal number of misclassified samples is five (and is reached by RPLS), namely samples 32,64,68,84 and 92. Samples 32, 84 and 92 are misclassified for all of the LO analysis
$$(p\in {\mathcal{P}}_{p},\kappa \in {\mathcal{K}}_{p})$$
; samples 64 and 68 are misclassified in more than 96 and 91% of the LO analysis. In the resampling study, MR[32] = 1, MR[64] ≥ 0.41, MR[68] ≥ 0.72, MR[84] = 1 and MR[92] ≥ 0.90.

## CONCLUSIONS

We have proposed a statistical dimension-reduction approach for the classification of tumors based on microarray gene expression data. Our method is designed to address the curse of dimensionality and overcome the problem of a high-dimensional gene expression space so common in such type of problems. We have provided a new extension of partial least squares to binary response data, that seems to have better properties than some of the currently used methods. We restricted our attention to the binary case, but the methodology can be extended to cover multi-class problems and we are interested in doing so. Indeed the structure of the algorithm for the binary case and multi-class case is the same, but the choice of the parameter λ necessitates more attention in the multi-class case than in the binary case. Future research will also concern the variable and model selection themes in order to determine optimal values for (λ, κ).

## APPENDIX: RPLS

For given (y, X), λ > 0 and κ ≥ 1.

1. Compute Z ← [𝕀nX] and Σ as in Equation (4).

2. RIRLS step:

• Initialize γ(0) ∈ ℝp+1.

$t\leftarrow 0.$

• Until convergence, do

$\begin{array}{c}{\eta }^{\left(t\right)}\leftarrow Z{\gamma }^{\left(t\right)}.\\ {\pi }^{\left(t\right)}\leftarrow [{\left(1+exp(-{\eta }_{k}^{\left(t\right)})\right)}^{-1},1\le k\le n{]}^{\prime }.\\ {W}^{\left(t\right)}\leftarrow \hbox{ diag }\left({\pi }^{\left(t\right)}\right(1-{\pi }^{\left(t\right)}\left)\right).\\ {z}^{\left(t\right)}\leftarrow {\eta }^{\left(t\right)}+{\left({W}^{\left(t\right)}\right)}^{-1}(\underset{-}{\hbox{ y }}-{\pi }^{\left(t\right)}).\\ {\gamma }^{(t+1)}\leftarrow {({Z}^{\prime }{W}^{\left(t\right)}Z+\lambda {\Sigma }^{2})}^{-1}{Z}^{\prime }{W}^{\left(t\right)}{z}^{\left(t\right)}.\\ t\leftarrow t+1.\end{array}$

• Set

$\begin{array}{c}{z}^{\infty }\leftarrow {z}^{(t-1)}.\\ {W}^{\infty }\leftarrow {W}^{(t-1)}.\end{array}$

3. WPLS step:

• $$\tilde{\Sigma }\leftarrow {\Sigma }_{2:p+1,2p+1,}{X}^{s}\leftarrow X{\tilde{\Sigma }}^{-1}.$$
.

• t0 ← 𝕀n, E0Xs, f0z, ω0 ← 0p, ψ ← Ip.

• For k = 0, …, κ,

$\begin{array}{c}{q}_{k}\leftarrow {t}_{k}^{\prime }{W}^{\infty }{f}_{k}/\left({t}_{k}^{\prime }{W}^{\infty }{t}_{k}\right).\\ {p}_{k}\leftarrow {E}_{k}^{\prime }{W}^{\infty }{t}_{k}/\left({t}_{k}^{\prime }{W}^{\infty }{t}_{k}\right).\\ {f}_{k+1}\leftarrow {f}_{k}-{q}_{k}{t}_{k}.\\ {E}_{k+1}\leftarrow {E}_{k}-{t}_{k}{p}_{k}^{\prime }.\\ \psi \leftarrow \psi ({\hbox{ I }}_{p}-{\omega }_{k}{p}_{k}^{\prime }).\\ {\omega }_{k+1}\leftarrow {{E}^{\prime }}_{k+1}{W}^{\infty }{f}_{k+1}.\\ {\tilde{\psi }}_{k+1}\leftarrow \psi {\omega }_{k+1}.\\ {t}_{k+1}\leftarrow {E}_{k+1}{\omega }_{k+1}.\end{array}$

• Set

$$\tilde{\Psi }\leftarrow [{\tilde{\psi }}_{1}\dots {\tilde{\psi }}_{k}]$$
and q ← [q1qκ]′.

• Conclude:

$\begin{array}{c}{\widehat{\gamma }}_{1}\leftarrow {q}_{0}-{p}_{0}^{\prime }\tilde{\Psi }q.\\ {\widehat{\gamma }}_{2:(p+1)}\leftarrow {\tilde{\Sigma }}^{-1}{\tilde{\Psi }}_{q}.\end{array}$

The procedure, presently derived in ℝp+1, can be equivalently derived in ℝr+1 where r + 1 = rank(Z) ≤ n. To that goal, compute UDV′, the singular values decomposition (svd) of (X − 𝕀n

$${\mathbb{I}}_{n}^{\prime }$$
X/n)
$${\tilde{\Sigma }}^{-1}$$
, the scaled covariate matrix; collect the first r columns of UD in Ξ = (UD)·,1:r so that Z γ = [𝕀n Ξ]θ for some θ ∈ ℝr+1. We denote by J(r) a diagonal matrix with
$${J}_{1,1}^{\left(r\right)}=0$$
and
$${J}_{k,k}^{\left(r\right)}=1$$
, k = 2,…,r + 1. It is readily seen that RPLS, run by replacing (X, Σ2) by (Ξ, J(r)), yields an estimate
$${\widehat{\theta }}^{\hbox{ PLS },\kappa }$$
uniquely related to
$${\widehat{\gamma }}^{\hbox{ PLS },\kappa }$$
by the formulas
$\begin{array}{cc}{\widehat{\gamma }}_{2:p+1}={\tilde{\Sigma }}^{-1}V.{,}_{1:r}{\widehat{\theta }}_{2:r+1},& {\widehat{\gamma }}_{1}={\widehat{\theta }}_{1}-{\hbox{ \mathbb{I} }}_{n}^{\prime }X{\widehat{\gamma }}_{2:p+1}/n.\end{array}$
Hence, up to a single svd, the procedure is independent of p which is of computational importance.

Table 1

Comparison of misclassification for Leukemia data (leave-one-out and out-of-sample analyses performed on the Learning/Test set of Golub's subdivision)

p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
LO OS LO OS LO OS LO OS LO OS LO OS LO OS LO OS
50 0(3) 1(1) 0(2) 4(3) 1(1)
300 0(1) 0(2) 0(2) 2(1) 1(1)
500 0(1) 0(3) 0(2) 0(1) 0(1)
1000 0(2) 0(4) 0(2) 1(1) 0(1)
p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
LO OS LO OS LO OS LO OS LO OS LO OS LO OS LO OS
50 0(3) 1(1) 0(2) 4(3) 1(1)
300 0(1) 0(2) 0(2) 2(1) 1(1)
500 0(1) 0(3) 0(2) 0(1) 0(1)
1000 0(2) 0(4) 0(2) 1(1) 0(1)

Table 2

Comparison of misclassification for Colon data

p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
100 9(1) 7(6) 8(1)* 12(1) 17 17 7(5)
500 10 8(3) 8(5) 8(1)* 7(6) 18 22 9(5)
1000 15 7(3) 7(6) 8(1)* 15(1) 20 23 8(7)
pmax 17 7(3) 7(6) 8(1)* 6(4) 22 25 8(7)
p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
100 9(1) 7(6) 8(1)* 12(1) 17 17 7(5)
500 10 8(3) 8(5) 8(1)* 7(6) 18 22 9(5)
1000 15 7(3) 7(6) 8(1)* 15(1) 20 23 8(7)
pmax 17 7(3) 7(6) 8(1)* 6(4) 22 25 8(7)

Leave-one-out analysis performed on 62 subdivisions of the data set into a learning set (resp. test set) of cardinal 61 (resp. cardinal 1). * means that during the leave-one-out procedure, for a given κ in the range

$${\mathcal{K}}_{c}$$
, some FPLS algorithms did not converge. The optimal value of κ is chosen among the values for which all the FPLS steps converged.

Table 3

Comparison of misclassification for Prostate data

p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
100 7(3) 6(8) 8(2)* 51(1) 11 11 7(3)
500 10 8(2) 9(6) 8(2)* 7(2) 21 18 8(13)
1000 10 5(3) 5(13) 8(2)* 8(4) 28 24 10(3)
1500 10 7(4) 5(12) 10(2)* 14(4) 31 28 12(3)
p RIRLS RPLS RPCR FPLS MAVE DLDA DQDA KNN
100 7(3) 6(8) 8(2)* 51(1) 11 11 7(3)
500 10 8(2) 9(6) 8(2)* 7(2) 21 18 8(13)
1000 10 5(3) 5(13) 8(2)* 8(4) 28 24 10(3)
1500 10 7(4) 5(12) 10(2)* 14(4) 31 28 12(3)

Leave-one-out analysis performed on 102 subdivisions of the data set into a learning set (resp. test set) of cardinal 101 (resp. cardinal 1). * has the same meaning as in Table 2.

Fig. 1

Resampling analysis for Leukemia data: boxplots of test error rates for classifiers with 50 (white), 300 (light grey), 500 (dark grey) and 1000 (black) genes.

Fig. 1

Resampling analysis for Leukemia data: boxplots of test error rates for classifiers with 50 (white), 300 (light grey), 500 (dark grey) and 1000 (black) genes.

Fig. 2

Resampling analysis for Colon data: boxplots of test error rates for classifiers with 100 (white), 500 (light grey), 1000 (dark grey) and pmax (black) genes (2:1 scheme).

Fig. 2

Resampling analysis for Colon data: boxplots of test error rates for classifiers with 100 (white), 500 (light grey), 1000 (dark grey) and pmax (black) genes (2:1 scheme).

Fig. 3

Resampling analysis for Prostate data: boxplots of test error rates for classifiers with 100 (white), 500 (light grey), 1000 (dark grey) and 1500 (black) genes (2:1 scheme).

Fig. 3

Resampling analysis for Prostate data: boxplots of test error rates for classifiers with 100 (white), 500 (light grey), 1000 (dark grey) and 1500 (black) genes (2:1 scheme).

The authors are really grateful to A. Antoniadis for constructive and fruitful discussions and to the referees for their constructive comments and criticisms which have substantially improved this article. They would like also thank I. De Feis for helpful comments and B. Ding and R. Gentleman for providing a preprint of their paper prior to publication.

Part of this work was supported by the research project ASBGEN and the Interuniversity Attraction Pole (IAP) research network in Statistics P5/24.

## REFERENCES

Albert, A. and Anderson, J.
1984
On the existence of maximum likelihood estimates in logistic regression models.
Biometrika

71
1
–10
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.
1999
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

96
6745
–6750
Antoniadis, A., Lambert-Lacroix, S., Leblanc, F.
2003
Effective dimension reduction methods for tumor classification using gene expression data.
Bioinformatics

19
563
–570
Devroye, L., Gyorfi, L., Lugosi, G.
Theory of Pattern Recognition

1996
, New York Springer-Verlag
Ding, B. and Gentleman, R.
2004
Classification using generalized partial least squares. Technical Report 5, Bioconductor Project Working Papers. http://www.bepress.com/bioconductor/paper5
Dudoit, S., Fridlyand, J., Speed, T.
2002
Comparison of discrimination methods for the classification of tumors using gene expression data.
J. Amer. Stat. Assoc.

97
, pp.
77
–87
Eilers, P., Boer, J., Van Ommen, G., Van Houwelingen, H.
2001
Classification of microarray data with penalized logistic regression.
Proceedings of SPIE. Progress in Biomedical Optics and Images
vol.
4266
, pp.
187
–198
Fahrmeir, L. and Tutz, G.
Multivariate Statistical Modelling Based on Generalized Linear Models

2001
2nd edn , New York Springer Series in Statistics
Frank, I. and Friedman, J.
1993
A statistical view of some chemometrics regression tools, with discussion.
Technometrics

35
, pp.
109
–148
Furey, T., Cristianini, N., Duffy, N., Bednarsky, D., Schummer, M., Haussler, D.
2000
Support vector machine classification and validation of cancer tissue samples using microarray expression data.
Bioinformatics

16
906
–914
Ghosh, D.
2002
Singular value decomposition regression models for classification of tumors from microarray experiments.
Pac. Symp. Biocomput.

98
18
–29
Ghosh, D.
2003
Penalized discriminant methods for the classification of tumors from gene expression data.
Biometrics

59
992
–1000
Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.
1999
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science

286
531
–537
Green, P.
1984
Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives.
J.R. Statist.Soc. B

46
149
–192
Helland, I.
1988
On the structure of partial least squares regression.
Commun. Statist., Simulation Comput.

17
581
–607
Huang, X. and Pan, W.
2003
Linear regression and two-class classification with gene expression data.
Bioinformatics

19
2072
–2078
Kass, R. and Raftery, A.
1995
Bayes factor.
J. Amer. Stat. Assoc.

90
733
–795
Le Cessie, S. and Van Houwelingen, J.
1992
Ridge estimators in logistic regression.
J. R. Statist. Soc. C

41
191
–201
Marx, B.D.
1996
Iteratively reweighted partial least squares estimation for generalized linear regression.
Technometrics

38
374
–381
Massy, W.F.
1965
Principal components regression in exploratory statistical research.
J. Amer. Statist. Assoc.

60
234
–246
Naes, T. and Martens, H.
1985
Comparison of prediction methods for multicollinear data.
Commun. Statist. Simulation Comput.

14
545
–576
Nguyen, D. and Rocke, D.
2002
Tumor classification by partial least squares using microarray gene expression data.
Bioinformatics

18
39
–50
Santner, T. and Duffy, D.
1986
A note on A. Albert and J.A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models.
Biometrika

73
755
–758
Singh, D., Febbo, P., Ross, D., Jackson, G., Manola, J., Ladd, C., Tamayo, A., Renshaw, A., D'Amico, A.V., Richie, J., et al.
2002
Gene expression correlates of clinical prostate cancer behavior.
Cancer Cell

1
203
–209
Timm, N.
2002
Applied Multivariate Analysis. , New York Springer-Verlag
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J., Marks, J., Nevins, J.
2001
Predicting the clinical status of human breast cancer by using gene expression profiles.

98
11462
–11467
Wold, H.
1975
Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach. In Gani, J. (Ed.).
Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett