-
PDF
- Split View
-
Views
-
Cite
Cite
Anne-Laure Boulesteix, Korbinian Strimmer, Partial least squares: a versatile tool for the analysis of high-dimensional genomic data, Briefings in Bioinformatics, Volume 8, Issue 1, January 2007, Pages 32–44, https://doi.org/10.1093/bib/bbl016
Close -
Share
Abstract
Partial least squares (PLS) is an efficient statistical regression technique that is highly suited for the analysis of genomic and proteomic data. In this article, we review both the theory underlying PLS as well as a host of bioinformatics applications of PLS. In particular, we provide a systematic comparison of the PLS approaches currently employed, and discuss analysis problems as diverse as, e.g. tumor classification from transcriptome data, identification of relevant genes, survival analysis and modeling of gene networks and transcription factor activities.
INTRODUCTION
In the last few years, multivariate statistical methods for the analysis of high-dimensional genomic data have been the subject of numerous publications in statistics, machine learning, bioinformatics and biology. A challenging problem connected with these data is that they contain typically many more variables (p, genes and features) than observations (n, gene chips and time points). For instance, it is not uncommon to collect expression data for 20 000 genes using only 10–20 microarrays. Since many traditional multivariate methods are not applicable in this case, predicting, e.g. the survival time or the tumor class of a patient with such high-dimensional data is a difficult and challenging task that requires special techniques such as variable selection or dimension reduction.
In this article, we survey the application of partial least squares (PLS), a powerful yet comparatively unknown approach for analyzing high-dimensional data, to problems in bioinformatics and genomics. The PLS method was first developed by Herman Wold in the 1960s and 1970s to address problems in econometric path modeling, and was subsequently adopted by his son Svante Wold (and many others) in the 1980s for regression problems in chemometric and spectrometric modeling. Early references on path modeling are, e.g. Wold [5, 1–3]. One of the first applications of PLS to regression is Wold et al. [4]. Two recent studies [6] describe these early developments and provide a detailed chronological overview. PLS is still a highly active research area from a theoretical point of view; see for instance [7] for recent developments on the connections of PLS with Krylov subspaces and conjugate gradients. PLS started to attract the attention of statisticians only about 15 years ago—see e.g. [8–11]. This was mainly due to the ability of PLS to work very well for data with very small sample sizes and a large number of parameters. Thus, it is only natural that in the last few years this methodology is being successfully applied to problems in genomics and proteomics.
PLS methods are in general characterized by high computational and statistical efficiency. They also offer great flexibility and versatility in terms of the analysis problems that may be addressed. However, the literature of PLS is very diverse because of the existence of a large number of algorithmic variants of PLS, which render it very difficult to understand the principles underlying PLS. It is the aim of this article to fill this gap by, firstly, providing a systematic overview of the available PLS methods and, secondly, reviewing the broad range of their applications to genome data.
The remainder of the article is structured as follows. In ‘Methodological Foundations of Partial Least Squares’ section, we summarize the main methodological aspects of PLS regression. In ‘Applications of Partial Least Squares to High-dimensional Genomic Data’ section, various applications of PLS regression to microarray studies are reviewed. ‘Outlook and Generalizations of PLS’ section is devoted to PLS-based methods that are especially designed for particular types of response variables (for instance, survival time or categorical outcome) and to their practical use in microarray data analysis. A recapitulation of the notations and abbreviations that are used throughout the manuscript can be found in the appendix.
METHODOLOGICAL FOUNDATIONS OF PARTIAL LEAST SQUARES
In this section, we provide an introduction into the mathematics of PLS. In a nutshell, PLS is a dimension reduction approach that is coupled with a regression model. Unlike in similar approaches such as principal component regression, the latent components obtained by PLS are chosen with the response variable of the regression kept in mind.
PLS regression
, where
and
denote the ith observation of the predictor and response variables, respectively. The prime denotes uncentered basic data, as in [9]. Their removal indicates the subtraction of the sample average, i.e. In PLS, dimension reduction and regression are performed simultaneously, i.e. PLS outputs the matrix of regression coefficients B as well as the matrices W, T, P and Q, and hence the term PLS regression. In the PLS literature, the columns of T are often denoted as ‘latent variables’ or ‘scores’. In this study, we prefer the term ‘latent components’, since in PLS the columns of T are rather the result of a matrix decomposition than observations of underlying random variables. P and Q are often denoted as ‘X-loadings’ and ‘Y-loadings’, respectively.
The basic idea of the PLS method is that the response Y should be taken into account for the construction of the components T. More precisely, the components are defined such that they have high covariance with the response, as outlined in ‘Univariate response’ and ‘Multivariate response’ sections. That is why PLS is called a supervized method in contrast to, e.g. principal component analysis (PCA), which does not use the response for the construction of the new components. This feature explains why PLS usually performs better than PCA in prediction problems.
The characterization of the various PLS regression approaches might be done at four different levels:
the objective function maximized by the W matrix,
the W matrix itself,
the obtained matrix of regression coefficients B and
the algorithm used to compute W.
These four different levels are connected as follows:
The same W matrix can maximize several objective functions. But a given objective function is generally satisfied by only one W matrix (and its opposite–W).
There might be several algorithms that output the same W matrix.
A given W matrix leads to only one possible matrix of regression coefficients. But two different matrices W and W* can lead to the same regression coefficients if there exists an invertible c × c matrix M such that W* = WM. Note that, although W and W* lead to the same prediction, they do not necessarily satisfy the same objective function.
Univariate response
In PLS univariate regression, there is only one commonly adopted objective function. The columns w1, …, wc of the p × c weight matrix W are defined such that the squared sample covariance between Y and the latent components is maximal under the condition that the latent components are mutually empirically uncorrelated. Moreover, the vectors w1, …, wc are constrained to be of unit length.
Objective function 1: Univariate PLS (PLS1)
and
, for j = 1, …, i − 1,where c is the number of latent components fixed by the user. The maximal number of such latent components that have non-zero covariance with Y is cmax = min (n − 1, p). The weight vectors w1, …, wc can be computed sequentially via a simple and fast non-iterative algorithm given, e.g. in [12] and denoted as ‘algorithm with orthogonal scores’ because the matrix TTT is diagonal. Martens and Naes [12] also give another algorithm denoted as ‘algorithm with orthogonal loadings’, which outputs a different W matrix. Using this algorithm, one obtains orthogonal loadings instead of orthogonal latent components (PTP is diagonal but not TTT). It can be shown [8] that the resulting regression coefficients in matrix B are the same with both algorithms. Since the orthogonal latent components are easier to interpret than orthogonal loadings, the first algorithm is almost always preferred in the literature. Some statistical aspects of PLS1 regression are discussed by, e.g. [9–11]. From a practical point of view, the objective function of PLS1 can be interpreted as follows. From Equation (4), it is clear that the components constructed in PLS1 have maximal covariance with the response and thus have high predictive power. Moreover, they are not redundant since mutually uncorrelated. The case of multivariate response (q > 1) is presented in the following section.Multivariate response
The case of a multivariate response is more difficult to handle since one has to find latent components which explain all the responses Y1, …, Yq simultaneously. There are two main variants for multivariate PLS regression. The first variant is usually denoted as PLS2 in contrast to the univariate method PLS1, or simply PLS. To avoid misunderstandings, we use the term PLS2. The W matrix corresponding to PLS2 may be obtained via several algorithms. The most well-known are the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm and the Kernel-PLS algorithm, which are implemented in the R packages pls and pls.pcr. Recently, ter Braak and de Jong [13] discovered that the PLS2 maximizes the same expression as Statistically Inspired Modification of PLS (SIMPLS) but with different and less intuitive constraints.
Objective function 2: PLS2
and
, for j = 1, …, i − 1, where Ip denotes the p × p identity matrix and W+ is the unique Moore–Penrose inverse of W.The second important variant of multivariate regression is SIMPLS, which was first introduced by de Jong [14]. In contrast to PLS2, SIMPLS was first developed as an optimality problem. Algorithms were then developed to solve this optimality problem.
Objective function 3: SIMPLS
Objective function 4: SIMPLS (equivalent formulation)
As for PLS2, there exist several algorithms that solve the optimality problem of SIMPLS. One of them is implemented in the function simpls from the R package pls.pcr. A particularity of the R function simpls is that it returns unit length scores instead of unit length weights (as one would expect when considering objective function 3). By transforming the weights to have unit length, one obtains weights satisfying objective function 3. A user-friendly version of SIMPLS implementing this transformation can be found in the R package plsgenomics [16].
APPLICATIONS OF PARTIAL LEAST SQUARES TO HIGH-DIMENSIONAL GENOMIC DATA
Regression problems
Any genomic analysis that incorporates a regression model may profit from the application of PLS. Some important recent examples are briefly reviewed in this section.
A straightforward application of univariate PLS regression to expression data from yeast Saccharomyces cerevisiae can be found in [17]. In this study some handpicked gene expression levels are regressed against expression levels of other genes using PLS1 with different numbers of latent components. The magnitude of the obtained regression coefficients are interpreted in terms of interaction strength between genes.
PLS regression has also been successfully applied to missing values imputation in microarray data by Bras and Menezes [18]. In this approach, the missing values are imputed by PLS regression using all the genes with observed values as predictors. Another reference on PLS imputation in the context of microarray data is Nguyen et al. [19].
Huang et al. [20] use PLS regression for a prediction purpose. The aim is to model a continuous variable (LVAD support time) using p gene expression levels as predictors. LVAD stands for ‘left mechanical ventricular assist device’ and is a successful substitution therapy for heart failure patients waiting for transplantation. Although PLS regression can handle a very large number of predictors and can thus be applied to this problem without adaptation, Huang et al. [20] suggest a penalized version of PLS regression (PPLS), which eliminates genes with poor prediction power. Their method is based on the shrinkage of the p regression coefficients obtained by PLS regression. After the shrinkage procedure, a number of genes (depending on the shrinkage parameter Δ) do not contribute anymore to the model. Huang et al. [20] suggest to use cross-validation for the selection of both the shrinkage parameter Δ and the number c of latent components used to produce the regression coefficients.
PLS regression is used by Johansson et al. [21] to identify periodically expressed genes. Johansson et al. [21] construct a virtual response Y that represent cyclic behavior with the same periodicity as the cell cycle. The genes that contribute significantly to the PLS regression model are then interpreted as cell-cycle regulated.
Applications of PLS multivariate regression to other types of data include the prediction of transcription factor activities from combined analysis of gene expression data and chromatin immunoprecipitation (ChIP) data as proposed by Boulesteix and Strimmer [16]. The transcription of genes is regulated by DNA binding proteins, which are known as transcription factors. An issue of interest for biologists is the estimation of the activity levels of these transcription factors. Available data material include microarray data for the potential target genes under different experimental conditions, and ‘connectivity’ data (e.g. ChIP data) giving the amount of interaction between the transcription factors and the considered genes. Boulesteix and Strimmer [16] assume as the relationship between microarray data and connectivity data the linear structure Y = A + XB + F, where Y is the n × q constant matrix containing the expression levels of n genes (rows) in q conditions (columns), X is the n x p matrix containing the connectivity information for n genes (rows) and p transcription factors (columns), A is a n × q matrix corresponding to the intercepts and E is a n × q error matrix. The p × q matrix B corresponds to the activity levels of the p transcription factors in the q considered conditions. Thus, the estimation of the transcription factor activities can be formulated as a simple regression problem that is solved in [16] by employing the SIMPLS method. Using PLS in this context allows not only to extract information on the transcription factors activities but also to identify coherent ‘meta-factors’ corresponding to the different latent components.
Other applications of PLS to regression problems in genomic data analysis include, e.g. the prediction of the protein structure (e.g. the helix or strand content using high-dimensional sequence data [22]).
Classification problems
Using this transformation, it can be shown that multivariate PLS dimension reduction (almost) leads to the same components as PCA performed on the between-group sample covariance matrix. A collection of properties on this topic as well as mathematical proofs are given in [23]. These properties can be seen as a justification of PLS dimension reduction with categorical variables. Recently, many researchers have considered the PLS methods for classification:
In two independent comparative studies by Man et al. [24] and Huang et al. [25], classification based on PLS regression is reported to lead to high prediction accuracy.
PLS classification analysis for binary response has been investigated by Huang and Pan [26] for leukemia [27] and colon cancer data [28]. Each observation is assigned to one of the two classes 0 or 1, depending on the continuous prediction. Huang and Pan [26] suggest to determine the best number of latent components by leave-one-out cross-validation.
A similar approach is used in a more applied study by Perez-Enciso and Tenenhaus [29]: various binary outcomes such as (i) before versus after chemotherapy treatment in a case-control study, (ii) estrogen receptor positive versus negative tumors and (iii) tumor type are predicted via PLS discriminant analysis.
PLS regression is also employed for multiclass classification in [30] for the molecular diagnostic of cancer. Using the software SIMCA, they performed classification with the National Cancer Institute (NCI) data set [31] giving the expression levels of 9605 genes in 60 tumor cell lines of eight different types (leukemia, non-small-cell lung, colon, melanoma, ovarian, breast, central nervous system and renal).
Other classification studies based on PLS regression can be found in [32–36]. A similar approach based on PLS regression to perform classification in the context of meta-analysis is suggested in [37].
There exists another route to classification using partial least squares, first proposed by Nguyen and Rocke [38, 39] and further studied by Boulesteix [40] and compared with other dimension reduction techniques in [41]. This approach first employs PLS as a dimension reduction method and subsequently uses the PLS latent components as predictors in a classical discrimination method (e.g. logistic regression, linear or quadratic discriminant analysis). To apply this method, one has to choose (i) the number of latent components to be extracted in the dimension reduction step and (ii) the classification method to be used for the classification step.
In Nguyen and Rocke [38, 39], three classification methods are studied: logistic regression, linear discriminant analysis and quadratic discriminant analysis. In [40], the only investigated classification method is linear discriminant analysis. Generally, linear discriminant analysis (LDA) turns out to yield the best classification performance, whereas quadratic discriminant analysis gives worse results. In the extensive comparison study performed by Boulesteix [40], which included many currently employed methods, PLS+LDA turns out to range among the best classification procedures for all the eight studied cancer data sets. According to this study, the most successful other methods are the nearest centroids approach by Tibshirani et al. [42] and the support vector machines.
Feature selection
An issue that is tightly connected with the prediction of a clinical outcome is the identification of genes whose expression levels are associated with the considered outcome. For instance, a physician might want to find out which genes have different expression levels in tumor tissues and normal tissues. The selection of relevant genes is important both for biologists who aim to understand the function of genes and the cell processes and for statisticians who want to apply statistical methods which can handle a restricted number of variables.
of PLS1 if the columns of the predictor matrix X have been preliminarily scaled to unit variance. Thus, the ordering of the genes obtained from the weight vector w1 is equivalent to the ordering obtained using the F-statistic, which is one of the most common ordering criteria in microarray data analysis. It shows that PLS dimension reduction and variable selection are in fact two tightly related procedures and also indicates that PLS methods take more information into account than usual univariate gene selection procedures, since they often involve more than one latent component. Similar results might also be obtained in the framework of regression.A gene selection approach based on several PLS latent components is applied to gene expression data by Musumarra et al. [30, 43]. It is based on all the weight vectors w1, …, wc and implemented in the software package SIMCA. The 'variable influence' VINγj of gene j for the γ-th PLS component is defined as a function of
and the proportion of the sum of squares explained by the γ-th latent component. Finally, the genes are ordered according to their ‘variable importance in the projection’ VIPj, which is defined for each gene j as the sum of the VINγj over the c PLS latent components. An advantage of this approach is that it captures information on the single genes from all the PLS latent components included in the analysis. Thus, it can also discover non-linear patterns which the F-statistic would fail to detect. A major drawback of the VIP index is its lack of theoretical background. One might investigate its connections to the matrix of regression coefficients.
Survival analysis
Another issue of interest in the statistical analysis of gene expression data is the prediction of the survival time Y of diseased patients using their gene expression profiles. In this context, survival data are usually denoted as a triple (t, δ, x), where:
t is a continuous variable usually called failure time which equals the time to death Y if δ = 1 or the time to censoring if δ = 0,
δ is a binary variable, which equals 1 if the death of the patient was observed before censoring and 0 if the patient was still alive at the end of the study,
x = (X1, …, Xp)T is a vector of p continuous gene expression levels which are considered as predictor variables.
Standard approaches to predict survival times using continuous predictors such as the proportional hazard regression model (PH model) by Cox [44] may not be applied directly if n < p. Various approaches based on the clustering of genes or observations have been proposed, with the inconvenience that the results depend on the chosen clustering algorithm. PLS-based survival analysis is another important family of methods for survival analysis with many predictors.
Nguyen and Rocke [45] suggest a two-stage method that (i) performs univariate PLS with the failure time as response variable and X1, …, Xp as predictors and (ii) uses the obtained first latent components as predictors in classical PH regression. They apply their approach to lymphoma data [46] giving the survival time and expression levels of 5622 genes for 40 lymphoma patients and to breast cancer data [47] giving the survival time and expression levels of 3846 genes for 49 breast cancer patients. In this two-step procedure, dimension reduction and prediction using PH regression are performed successively. The specificity of the failure time is not taken into account during the dimension reduction stage: it treats both time to death and time to censoring as the same continuous variable in the dimension reduction step, which is a severe drawback if censoring is non-negligible. Improvements of this approach are proposed in [48–50]. These approaches combine the construction of the successive PLS latent components with PH regression, but in different ways. They are reviewed in ‘Outlook and Generalizations of PLS’ section which deals with PLS-based methods for special response variables.
Available software
There are currently four R packages that implement partial least squares approaches:
plsgenomics
(http://cran.r-project.org/src/contrib/Descriptions/plsgenomics.html)
This package implements PLS regression (using the function simpls from the pls.pcr package) with user-friendly features such as the choice of the number of components. It also implements the classification method PLS+LDA presented in ‘Classification problem’ section and discussed by Nguyen and Rocke [38, 39] and Boulesteix [40] as well as the ridge PLS method [51] mentioned in ‘PLS and generalized linear models’ section.
pls.pcr
(http://cran.r-project.org/src/contrib/Descriptions/pls.pcr.html)
This package implements the two main variants of multivariate PLS regression SIMPLS and PLS2 as well as PCR.
pls
(http://cran.r-project.org/src/contrib/Descriptions/pls.html)
This package is an extension of the earlier package pls.pcr including, e.g. various plot functions and a formula interface.
gpls
(http://cran.r-project.org/src/contrib/Descriptions/gpls.html)
This package implements the classification method using generalized PLS [52] mentioned in ‘PLS and generalized linear Models’ section.
plss
http://www.math.univ-montp2.fr/~durand/ProgramSources.html)
These programs implement PLS regression based on splines transformations of the predictors [53]. They work only under R for Windows.
Other software
Classification with PLS regression (PLS-DA), (DA, discriminant analysis) is implemented in the software tool SIMCA.
(http://www.umetrics.com/default.asp/pagename/software_simcap/c/3/).
The SAS procedure PLS implements several dimension reduction methods such as PCR, Reduced Rank Regression (RRR) and PLS. The two main versions of multivariate PLS (SIMPLS and PLS2) are available. For PLS2, one may specify the algorithmic variant as an option, for instance NIPALS.
The PLS Toolbox (by Eigenvector Research Incorporated) for use with MATLAB
(http://software.eigenvector.com/toolbox/3_5/index.html)
includes a wide range of methods for multivariate statistical analysis, some of which are based on PLS regression. In particular, it includes the function plsda, which performs classification (class prediction) based on SIMPLS or PLS2 regression.
The software tool Unscrambler
(http://www.camo.com/rt/Products/Unscrambler/unscrambler.html)
also implements multivariate PLS1 and multivariate regression (PLS2) and PLS-DA.
OUTLOOK AND GENERALIZATIONS OF PLS
So far, we have considered applications of PLS regression to various biological problems. However, applying a regression method designed for continuous responses to categorical responses or performing dimension reduction with survival data without taking censoring into account is unappealing, although it is reported to give good results in many cases. In this section, we review methods that use the principle of PLS regression but adapt it to handle special types of responses such as survival time or categorical outcome. These methods can be divided into two categories. In the first category of methods, the structure of the univariate PLS regression algorithm remains unchanged, but the coefficients used to construct the latent components are modified. In the second category of methods, the PLS algorithm is embedded into a complex generalized regression procedure. Both approaches can be applied to, e.g. survival analysis and classification. In the following section, we consider only the univariate case, i.e. Y is a n × 1 matrix (n vector).
Modification of the latent components in PLS regression
and xij denoting the element of X at row i and column j, simple transformations lead to where βj is the least squares regression coefficient obtained by regressing Y against Xj. The subsequent vectors t2, …, tc may be expressed in a similar way using deflated matrices. Several studies are based on the idea that βj is not an optimal choice when Y is a binary or survival variable. Li and Gui [50] suggest to replace βj by the regression coefficient of Xj obtained via Cox regression analysis, thus taking the specificity of the response variable Y into account. For the construction of t1, Y is regressed against Xj. For the construction of tj, j > 1, Y is regressed against Xj and the j-1 first latent components. A similar approach is proposed by Bastien [54] and studied from a methodological point of view in [55]. The idea to replace a linear regression coefficient by a Cox regression coefficient also inspired another method denoted as ‘MPLS’: Nguyen [48] gives a different non-sequential expression of the PLS1 latent components t1, …, tc involving eigenvectors of the matrices XTX and XXT (see [56] for details). This complex expression also contains a linear regression coefficient, which Nguyen [48] replaces by a Cox regression coefficient. The same approach is also used in the context of binary classification [56] and denoted as ‘PLSM2’.A related approach denoted as PLS logistic regression is used in [57] to map complex trait genes using gene expression data. In this setting, the response is a categorical genetic trait and the latent components t2, …, tc are constructed based on the regression coefficients estimated from a logistic regression model. Perez-Enciso et al. [57] demonstrate the potentialities of this approach based on an extensive simulation study.
PLS and generalized linear models
Marx [58] proposes an extension of the concept of PLS regression into the framework of generalized linear models. This approach, which is denoted as iteratively reweighted partial least squares (IRPLS or IRWPLS), embeds the univariate PLS regression algorithm into the iterative steps of the usual Iteratively Reweighted Least Squares algorithm [59] for generalized linear models, resulting in two nested loops. The loops are iterated a fixed number of times or until a convergence criterion is reached. This apparently appealing approach has a major drawback in practical microarray data analysis: convergence is never reached if X is full row-rank, which is most often the case in high-dimensional microarray data with n ≪ p [51]. The IRPLS method as well as a few adaptations overcoming the convergence problem have been applied both to survival analysis and classification. Binary classification is one of the most common applications of generalized linear models and of Marx's IRPLS algorithm. To our knowledge, the IRPLS algorithm has never been applied directly to classification with microarray data. However, it has inspired at least two recent papers on the generalization of PLS regression to categorical response variables.
The first approach is proposed by Ding and Gentleman [52] and can be seen as an adaptation of Marx's IRPLS method which solves the problem of separation. As already mentioned in ‘Classification problems’ section, infinite parameter estimates can occur in binary logistic regression when the two classes are completely or quasi-completely separated [60]. Firth [61] suggests a procedure to remove the first-order term of the asymptotic bias of maximum likelihood estimates in Generalized Linear Models (GLMs). The procedure is based on a modified score function which, when applied to logistic regression, guarantees finite estimates [62]. The binary classification method obtained by using the Firth's modified score function in place of the usual score function in the IRPLS algorithm is denoted as IRWPLSF by Ding and Gentleman [52]. They also propose a generalization of the method to multicategorical response variables, which is based on the multinomial logit model and denoted as MIRWPLSF. The IRWPLSF and MIRWPLSF are reported to achieve a slightly better classification performance than usual classification methods such as nearest neighbors or SVM on the colon cancer data [28] and on the NCI cancer data [31]. The second approach to modify Marx's IRPLS is suggested in [51]: the procedure embeds a PLS step into ridge penalty logistic regression and might also be generalized to multicategorical responses. This method is applied with success to the colon cancer data [28], the leukemia data [27] and the prostate cancer data [63].
Another classical application of generalized linear models and IRPLS is survival analysis. As suggested in [64], Park et al. [49] transform the failure time problem into a generalized linear regression problem with logarithmic link function. They propose to use the IRPLS estimation method for generalized linear regression [58]. In contrast to the two-stage scheme developed in [45], this method takes censoring explicitly into account. The choice of the number of components is done via a cross-validation procedure which suggests to use c = 1 for the lung cancer data set [65]. According to Park et al. [49] convergence is achieved in a few steps. However, this property seems to be controversial and lack of convergence problems are invoked as a drawback of the method in the more recent paper by Li and Gui [50].
CONCLUSIONS
The microarray ‘revolution’ has lead to an enormous increase in the availability of high-dimensional biomedical data. Classical multivariate methods are not applicable to these ‘small n, large p’ data sets. In this article we have reviewed the PLS approach to regression and dimension reduction that is perfectly suited for analysing this kind of data.
Specifically, PLS has several advantages over many competing approaches:
It automatically performs variable selection.
It can be applied to a diverse set of tasks, including classification, survival analysis and modeling of transcription factors activities.
It is statistically very efficient.
Moreover, it is computationally very fast, which renders it practical for application to large data sets.
As outlined in ‘Application of Partial Least Squares to High-dimensional Genomic Data’ and ‘Outlook and Generalizations of PLS’ sections of this review, at present most reported applications of the PLS method to genomic data focus on the analysis of microarray data from gene expression experiments. The key advantages that characterize the PLS methodology are versatility and flexibility. On the one hand, it can be directly applied to various types of data of any dimensions for different prediction or imputation problems. On the other hand, PLS algorithms adapt easily to a broad range of questions and thus serve as a flexible basis for the development of novel tools for the analysis biological data. In short, we expect that with the advent of proteomics data, e.g. from mass spectrometric experiments, PLS will in the future also play a major role for analysing many other kinds of high-dimensional omics data.
PLS is an efficient statistical prediction tool that is especially appropriate for small sample data with many (possibly correlated) variables.
PLS is fast, easy to implement and does not necessitate any preliminary feature selection.
The problems that may be addressed by the PLS method are very diverse and include, e.g. tumor diagnosis, survival analysis, and modeling of regulation network.
References
APPENDIX
List of abbreviations
| Term . | Signification . | Introduced in sections . |
|---|---|---|
| PLS1 | Univariate PLS | Univariate response |
| PLS2 | Multivariate PLS (first) | Multivariate response |
| SIMPLS | Multivariate PLS (second) | Univariate response |
| OLS | Ordinary Least Squares | |
| PCR | Principal Component Regression | |
| PCA | Principal Component Analysis | |
| RRR | Reduced Rank Regression | |
| PLS+LDA | Two-step classification procedure consisting | Classification problems |
| of PLS dimension reduction and LDA | ||
| IRPLS | Marx's Iteratively Reweighted PLS | PLS and generalized linear models |
| X = (xij)i = 1, … , n, j = 1, … , p | n × p matrix of predictors | PLS regression |
| Y = (yij)i = 1, × ,n,j = 1, … , q | n × q response matrix | PLS regression |
| X1, … , Xp | Uncentered predictor variables (random variables) | PLS regression |
| Y1, … , Yq | Uncentered response variables (random variables) | PLS regression |
| Uncentered sample | PLS regression |
| (xi, yi)i = 1, …, n | Centered sample | PLS regression |
| wj = (w1j, … , wpj)T | Weight vector defining the j-th latent component | PLS regression |
| tj = (t1j, … , tnj)T | j-th latent component | PLS regression |
| T = [t1, … , tc] | n × c matrix of latent components | PLS regression |
| W = [w1, … , wc] | p × c matrix of weights | PLS regression |
| Tj, j = 1, … , c | (Uncentered) random variable corresponding to tj | PLS regression |
| P | p × c matrix of X-loadings | PLS regression |
| Q | q × c matrix of Y-loadings | PLS regression |
| E | n × p error matrix | PLS regression |
| F | n × q error matrix | PLS regression |
| B | p × q matrix of regression coefficients | PLS regression |
| Term . | Signification . | Introduced in sections . |
|---|---|---|
| PLS1 | Univariate PLS | Univariate response |
| PLS2 | Multivariate PLS (first) | Multivariate response |
| SIMPLS | Multivariate PLS (second) | Univariate response |
| OLS | Ordinary Least Squares | |
| PCR | Principal Component Regression | |
| PCA | Principal Component Analysis | |
| RRR | Reduced Rank Regression | |
| PLS+LDA | Two-step classification procedure consisting | Classification problems |
| of PLS dimension reduction and LDA | ||
| IRPLS | Marx's Iteratively Reweighted PLS | PLS and generalized linear models |
| X = (xij)i = 1, … , n, j = 1, … , p | n × p matrix of predictors | PLS regression |
| Y = (yij)i = 1, × ,n,j = 1, … , q | n × q response matrix | PLS regression |
| X1, … , Xp | Uncentered predictor variables (random variables) | PLS regression |
| Y1, … , Yq | Uncentered response variables (random variables) | PLS regression |
| Uncentered sample | PLS regression |
| (xi, yi)i = 1, …, n | Centered sample | PLS regression |
| wj = (w1j, … , wpj)T | Weight vector defining the j-th latent component | PLS regression |
| tj = (t1j, … , tnj)T | j-th latent component | PLS regression |
| T = [t1, … , tc] | n × c matrix of latent components | PLS regression |
| W = [w1, … , wc] | p × c matrix of weights | PLS regression |
| Tj, j = 1, … , c | (Uncentered) random variable corresponding to tj | PLS regression |
| P | p × c matrix of X-loadings | PLS regression |
| Q | q × c matrix of Y-loadings | PLS regression |
| E | n × p error matrix | PLS regression |
| F | n × q error matrix | PLS regression |
| B | p × q matrix of regression coefficients | PLS regression |
| Term . | Signification . | Introduced in sections . |
|---|---|---|
| PLS1 | Univariate PLS | Univariate response |
| PLS2 | Multivariate PLS (first) | Multivariate response |
| SIMPLS | Multivariate PLS (second) | Univariate response |
| OLS | Ordinary Least Squares | |
| PCR | Principal Component Regression | |
| PCA | Principal Component Analysis | |
| RRR | Reduced Rank Regression | |
| PLS+LDA | Two-step classification procedure consisting | Classification problems |
| of PLS dimension reduction and LDA | ||
| IRPLS | Marx's Iteratively Reweighted PLS | PLS and generalized linear models |
| X = (xij)i = 1, … , n, j = 1, … , p | n × p matrix of predictors | PLS regression |
| Y = (yij)i = 1, × ,n,j = 1, … , q | n × q response matrix | PLS regression |
| X1, … , Xp | Uncentered predictor variables (random variables) | PLS regression |
| Y1, … , Yq | Uncentered response variables (random variables) | PLS regression |
| Uncentered sample | PLS regression |
| (xi, yi)i = 1, …, n | Centered sample | PLS regression |
| wj = (w1j, … , wpj)T | Weight vector defining the j-th latent component | PLS regression |
| tj = (t1j, … , tnj)T | j-th latent component | PLS regression |
| T = [t1, … , tc] | n × c matrix of latent components | PLS regression |
| W = [w1, … , wc] | p × c matrix of weights | PLS regression |
| Tj, j = 1, … , c | (Uncentered) random variable corresponding to tj | PLS regression |
| P | p × c matrix of X-loadings | PLS regression |
| Q | q × c matrix of Y-loadings | PLS regression |
| E | n × p error matrix | PLS regression |
| F | n × q error matrix | PLS regression |
| B | p × q matrix of regression coefficients | PLS regression |
| Term . | Signification . | Introduced in sections . |
|---|---|---|
| PLS1 | Univariate PLS | Univariate response |
| PLS2 | Multivariate PLS (first) | Multivariate response |
| SIMPLS | Multivariate PLS (second) | Univariate response |
| OLS | Ordinary Least Squares | |
| PCR | Principal Component Regression | |
| PCA | Principal Component Analysis | |
| RRR | Reduced Rank Regression | |
| PLS+LDA | Two-step classification procedure consisting | Classification problems |
| of PLS dimension reduction and LDA | ||
| IRPLS | Marx's Iteratively Reweighted PLS | PLS and generalized linear models |
| X = (xij)i = 1, … , n, j = 1, … , p | n × p matrix of predictors | PLS regression |
| Y = (yij)i = 1, × ,n,j = 1, … , q | n × q response matrix | PLS regression |
| X1, … , Xp | Uncentered predictor variables (random variables) | PLS regression |
| Y1, … , Yq | Uncentered response variables (random variables) | PLS regression |
| Uncentered sample | PLS regression |
| (xi, yi)i = 1, …, n | Centered sample | PLS regression |
| wj = (w1j, … , wpj)T | Weight vector defining the j-th latent component | PLS regression |
| tj = (t1j, … , tnj)T | j-th latent component | PLS regression |
| T = [t1, … , tc] | n × c matrix of latent components | PLS regression |
| W = [w1, … , wc] | p × c matrix of weights | PLS regression |
| Tj, j = 1, … , c | (Uncentered) random variable corresponding to tj | PLS regression |
| P | p × c matrix of X-loadings | PLS regression |
| Q | q × c matrix of Y-loadings | PLS regression |
| E | n × p error matrix | PLS regression |
| F | n × q error matrix | PLS regression |
| B | p × q matrix of regression coefficients | PLS regression |





























