-
PDF
- Split View
-
Views
-
Cite
Cite
Benoît Liquet, Pierre Lafaye de Micheaux, Boris P. Hejblum, Rodolphe Thiébaut, Group and sparse group partial least square approaches applied in genomics context, Bioinformatics, Volume 32, Issue 1, January 2016, Pages 35–42, https://doi.org/10.1093/bioinformatics/btv535
- Share Icon Share
Abstract
Motivation: The association between two blocks of ‘omics’ data brings challenging issues in computational biology due to their size and complexity. Here, we focus on a class of multivariate statistical methods called partial least square (PLS). Sparse version of PLS (sPLS) operates integration of two datasets while simultaneously selecting the contributing variables. However, these methods do not take into account the important structural or group effects due to the relationship between markers among biological pathways. Hence, considering the predefined groups of markers (e.g. genesets), this could improve the relevance and the efficacy of the PLS approach.
Results: We propose two PLS extensions called group PLS (gPLS) and sparse gPLS (sgPLS). Our algorithm enables to study the relationship between two different types of omics data (e.g. SNP and gene expression) or between an omics dataset and multivariate phenotypes (e.g. cytokine secretion). We demonstrate the good performance of gPLS and sgPLS compared with the sPLS in the context of grouped data. Then, these methods are compared through an HIV therapeutic vaccine trial. Our approaches provide parsimonious models to reveal the relationship between gene abundance and the immunological response to the vaccine.
Availability and implementation: The approach is implemented in a comprehensive package called sgPLS available on the CRAN.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Recent advances in high-throughput ‘omics’ technologies enable quantitative measurements of expression or abundance of biological molecules of a whole biological system. Various popular ‘omics’ platforms in systems biology include transcriptomics, proteomics, cytomics and metabolomics. The integration of multi-layer information is required to fully unravel the complexities of a biological system, as each functional level is hypothesized to be related to each other (Jayawardana etal., 2015; Kitano, 2002). Furthermore, multi-layer information is increasingly available such as in standard clinical trials. As an example, the evaluation of vaccines in phase I/II trials incorporates various measurements of the cell counts (tens of population of interest), of the cell functionality by many ways including the production of cytokines (intra and extracellular) and of the gene expression (Palermo etal., 2011).
The integration of omics data is a challenging task. First, the high dimensionality of the data, i.e. the large number of measured biological entities (tens of thousands) makes it very difficult to obtain a good overview or understanding of the system under study. The noisy characteristics of such high-throughput data require a filtering process to be able to identify a clear signal. Second, because of experimental or financial constraints, the small number of samples or patients (typically < 50) makes the statistical inference difficult and argue for using the maximum amount of available information. Third, the integration of heterogeneous data also represents an analytical and numerical challenge to try to find common patterns in data from different origins.
In recent years, several statistical integrative approaches have been proposed in the literature to combine two blocks of omics data, often in an unsupervised framework. These approaches aim at selecting correlated biological entities from two datasets (Chun and Keleş, 2010;,Lê Cao etal., 2008, 2009; Parkhomenko etal., 2009; Waaijenborg etal., 2008; Witten etal., 2009) or more (Löfstedt et al., 2014; Tenenhaus and Tenenhaus, 2011). This abundant literature clearly illustrates that the integrative analysis of two datasets poses significant statistical challenges to deal with the high dimensionality of the data. In particular, sparse partial least squares (sPLSs), using a L1 penalty, has been developed for that purpose. With sPLS, it has been demonstrated that the integrative analysis of large scale omics datasets could generate new knowledge not accessible by the analysis of a single data type alone (Lê Cao etal., 2008, 2009). Moreover, the biological relevance of the approach has been demonstrated in recent studies (Morine etal., 2011; Rose etal., 2011).
However, group structures often existing within such data have not yet been accounted for in these analyses. For example, genes within the same pathway have similar functions and act together in regulating a biological system. These genes can add up to have a larger effect and therefore can be detected as a group [i.e. at a pathway or gene set level (Tyekucheva etal., 2011)]. This has been increasingly used thank to geneset enrichment analysis approaches (Subramanian etal., 2005).
Considering a group of features instead of individual features has been found to be effective for biomarker identification (Meier etal., 2008; Puig etal., 2009; Simon and Tibshirani, 2012; Yuan and Lin, 2006). Yuan and Lin (2006) proposed group lasso for group variables selection. Meier etal. (2008) extended it to logistic regression. Puig etal. (2009) and Simon et al. (2013) modified group lasso to solve the non-orthonormal matrices problem. Although group lasso penalty can increase the power for variable selection, it requires a strong group-sparsity (Huang and Zhang, 2010) and cannot yield sparsity within a group. Ma etal. (2007) proposed a supervised group lasso which selects both significant gene clusters and significant genes within clusters for logistic binary classification and Cox survival analysis. Simon et al. (2013) proposed a sparse group lasso penalty by combining an L1 penalty with group lasso to yield sparsity at both the group and individual feature level. Zhou (2010) applied it to genomic feature identification. Garcia etal. (2014) developed a sparse group-subgroup Lasso to accommodate selecting important groups, subgroups and individual predictors. In a regression context with a multivariate response variable, Li etal. (2015) have recently proposed a multivariate sparse group lasso.
Also some work has been reported to incorporate ‘group effect’ into a conventional canonical correlation analysis (CCA) model. Chen etal. (2013) studied structure-based CCA and proposed tree-based and network-based CCA (Chen etal., 2013). Chen and Liu (2012) incorporated group effect into an association study of nutrient intake with human gut microbiome (Chen and Liu, 2012). Both papers show an improvement when incorporating group effect; however, a priori knowledge of group structure is needed and only the group effect of one type of data is discussed. More recently, Lin etal. (2013) developed a more general group sparse CCA method, which have been illustrated on genomics datasets (human gliomas data and NCI60 data). Witten etal. (2009) proposed a general penalized matrix decomposition (PMD) approach, which include sparse principal components and CCA. Sparsity have been realized by introducing different penalties forms such as L1 penalty or fused lasso penalty to get smooth results in the context of ordered features. Based on generalized least square matrix decomposition, Allen etal. (2014) develop fast computational algorithms for generalized principal component analysis (PCA) and sparse PCA. In the same idea, a regularized PLS (RPLS) approach is proposed by Allen etal. (2013) to take into account the correlations between adjacent variables. However, none of the PMD and RPLS approaches have introduced group and sparse group lasso penalty.
Here, we develop in a more general framework a group PLS (gPLS) method and a sparse gPLS (sgPLS) method (see also Löfstedt etal. 2014). Both methods focus on sub-matrices decomposition taking into account the group structures. They could be used in ‘regression’ mode or in ‘canonical mode’. The gPLS model aims at performing selection at group level while sgPLS enables selection at both group and single feature levels. Both irrelevant groups of features and individual features in the remaining groups will be simultaneously discarded with sgPLS.
Our article is organized as follows. The model and algorithm for group and sgPLS are described in Section 2 after introducing the main steps of sPLS. We also present our extension in a context of PLS discriminant analysis. The performances of our approaches are compared with sPLS via a simulation study in Section 3. This section also contains an illustration of our method with an HIV vaccine study. The results are compared with the one obtained by applying the multivariate sparse group lasso recently proposed by Li etal. (2015).
2 Methods
2.1 Notations
Let and be two data matrices containing n observations (rows) of p predictors (gene expression) and q variables (cytokine secretion), respectively. The soft thresholding function is , where . The Frobenius norm is denoted , while the Euclidean vector norm is and the L1 vector norm is .
2.2 PLS and sPLS for integrative analysis
2.2.1 Partial least square
2.2.2 Sparse PLS
2.3 Group PLS and sparse group PLS
2.3.1 Group PLS
For k in : apply (7)norm
For l in : apply (8)norm
2.3.2 Sparse group PLS
For k in : if (10) is true else apply (15)norm
For l in : apply (16) is true else apply (17)norm
2.3.3 Remark
Calibration of the different tuning parameters is discussed in the Supplementary Material.
2.3.4 Extension to discriminant analysis of one dataset
PLSs has often been used for discrimination problems (Nguyen and Rocke 2002) by recoding the qualitative response as a dummy block matrix indicating the group of each sample (g being the number of groups). Barker and Rayens (2003) give some theoretical justification for this approach. One can also directly apply PLS regression on the data as if was a continuous matrix (from now on called PLS-DA). A sparse version has been proposed by Lê Cao etal. (2011a, b) using the Lagrangian form of PLS-DA to include a L1 constraint. Our algorithm for the gPLS-DA is obtained by replacing with and by using only a group penalty on the loading related to the matrix ( in Equation 3). Algorithm for sgPLS-DA is also obtained by replacing with and by using only penalties on the loading related to the matrix ( and in Equation 3). The tuning parameters ( and ) are calibrated sequentially by evaluating the classification error rate using leave-one-out or k-folds cross-validation.
3 Results and discussion
A simulation study is performed to demonstrate the good performance of gPLS and sgPLS when compared with sPLS. Then, the three methods are applied on an HIV vaccine evaluation study.
3.1 Simulation study
3.2 Recovering the signal
The first simulation corresponds to the case when the q = 10 variables from dataset have not been divided into groups. The sample size is n = 100 and we consider a model with only one latent variable . Dataset is divided into GX = 20 groups of 20 variables. Among them, only 4 groups each containing 15 true variables and 5 noise variables are associated to the response variables . We set the p-vector to have 15 1s, 30 -1s and 15 1.5s, the other components being set to 0. All 15 non-zero coefficients are assigned randomly into one group along with the remaining 5 zero coefficients corresponding to noise variables (see top left panel of Supplementary Fig. S1). In the top four panels of Supplementary Figure S1, the standard deviation σ has been set to 0.5. A standard deviation is set for the other panels. Supplementary Figure S1 displays the results of the loading vectors recovered by sPLS, gPLS and sgPLS. Notice that sPLS and sgPLS recover satisfactorily the true when the noise is small (). However, when , sPLS selects more noise variables than the true and misses out some true variables while sgPLS still works well. For both situations, gPLS recovers all groups but gives false positives within the groups. We arrive at the same conclusion when the q = 500 variables of are divided into GY = 25 groups (see Supplementary Figs S2–S5). In this situation, we consider a model with two latent variables and . The vector is chosen in the same way as previously described for . The two columns of are q-vectors containing 15 -1s, 15 -1.5s and 30 1s and the rest are 0 s. For a small standard deviation (Supplementary Figs S2–S3), all methods perform well to recover the loading vectors and related to the first component and the loading vectors and related to the second component. However, gPLS gives false positives within the groups. For a high standard deviation (Supplementary Figs S4–S5), sPLS selects more noise variables, while sgPLS still performs well by selecting only relevant variables into the true groups.
3.3 Effect of noise
We investigate the performance of our methods with respect to noise. This simulation corresponds to the situation when the q = 10 variables from dataset have not been divided into groups and when one latent variable has been generated. We use a sequence of 10 equispaced standard deviation values σ, from 1.5 to 3. Supplementary Figure S7 presents the average of TPR and TD over 50 replications for each method. For all noise levels, we found that gPLS and sgPLS manage to recover the true groups. Thus, TD for gPLS is equal to 20 as each of the 4 true groups contains 5 noise variables (FP). However, for high values of noise, the sparsity introduced within the group by sgPLS misses out some true variables. Note that sPLS also misses out true variables when the noise is increased and gets the highest TD due to the high number of false positives selected. We present in Supplementary Figure S6 the average, over 50 replications, of the signal recovered (original loading ) by the three methods when σ = 3. This figure highlights the number of false positives selected by sPLS.
3.4 Effect of the number of true variables in the group
We investigate the performance of our methods when the number of true variables within each group is varied. We are still in the framework of one latent variable generated and only the p = 400 variables from dataset have been divided into groups. Among them, 40 variables are associated to the latent variable. Dataset contains GX = 40 groups, each with the same group size of 10. We set the sample size n = 100 and the standard deviation of the noise σ = 2. The number of groups containing true variables is varied in while each group contains 10, 8, 5, 4, 2 and 1 true variables, respectively. Supplementary Figure S8 presents the average results over 50 replications. When true variables are distributed in 4 and 5 groups, both gPLS and sgPLS give a much higher value of TPR and a lower value of TD than sPLS. When the true variables are more sparsely distributed into different groups (number of groups increased), gPLS still gives high TPR but TD is increased. Note that sgPLS is the best method with high TPR and low TD values.
3.5 Effect of sample size
We investigate the performance of our methods when the sample size is increased by steps of 50 from n = 50 to n = 500. Dataset contains GX = 40 groups, each with the same group size of 10. Among the p = 400 variables in dataset , 60 are true variables, which are distributed evenly into 6 groups. This clearly advantages gPLS. Supplementary Figure S9 presents the average results over 50 replications of TPR and TD with respect to different sample sizes. It can be seen that gPLS gives better results than the two other methods by finding the true groups whatever the sample size. For both sPLS and sgPLS, TPR increases while TD decreases when sample size increases. The TPR of sPLS is most of the time lower than that of sgPLS. Moreover, the TD of sPLS is higher than both those of gPLS and sgPLS.
3.6 Effect of the dimension p
We investigate the performance of our methods when the dimension p of is increasing ().Dataset contains groups, each with the same group size of 50. Among the GX groups, groups contains 35 true variables. This simulation corresponds to the situation when the q = 10 variables from dataset have not been divided into groups and when one latent variable has been generated. We set different sample size and 100. The standard deviation of the noise is . Supplementary Table S4 presents the average results over 50 replications of TPR, false-positive rate (FPR) and TD for the different values of p and for different sample sizes. In this simulation setting gPLS manages to find the true groups whatever the dimension p and the sample size. As 15 noise variables are included in each true group, gPLS method gets higher FPR and TD than sgPLS and sPLS. Almost every time, sgPLS outperforms sPLS by getting higher TPR and lower TD and FPR.
3.7 Application for an HIV vaccine evaluation study
3.7.1 Description of the study
The method has been applied to an HIV vaccine trial: the DALIA trial (Lévy etal., 2014). In this trial, 19 HIV-infected patients have been included for an evaluation of the safety and the immunogenicity of a dendritic-cell-based vaccine. The vaccine was injected on weeks 0, 4, 8 and 12, while patients received an antiretroviral therapy. An interruption of the antiretrovirals was performed at week 24 and the treatment was resumed if the CD4 + T cell count dropped below 350 cells/L. After vaccination, a deep evaluation of the immune response was performed at week 16 while repeated measurements of the main immune markers and gene expression were performed every 4 weeks until the end of the trial at 48 weeks. The analyses of the immune markers showed a strong immunological response especially among the CD4 + T cell populations with an increase of cells producing various cytokines. After the antiretroviral treatment interruption, there was a strong viral replication that was heterogeneous among the study population and the immune response appeared to be associated with the observed maximum value of viral load during the rebound (Lévy etal., 2014). One of the question that follows these results was whether the immune response and its impact on the viral dynamics could be explained and predicted with the observation of the change of gene expression during vaccination. To answer this question, we first analyzed the repeated measurements of gene abundance performed by microarrays (Illumina HumanHT-12 v4). Because of the sparsity of the measurements, only geneset analyses allow to detect significant changes over time of group of genes. Using previously defined group of genes, so called modules (Chaussabel etal., 2008), we reported a significant change of gene expression among 69 modules over time before antiretroviral treatment interruption. Then, we asked how the gene abundance of the 5399 genes (p) from these 69 modules as measured at week 16 correlated with immune markers measured at the same time point. The immune markers were either direct measurement of cytokine concentrations on supernatants (IL21, IL2, IL13, IFNg) or a combination of markers as measured by multiplex and intracellular staining (Luminex score, TH1 score, CD4 polyfunctionnality). Seven (q) different immune markers have been used in the current application. Measurements and calculations are detailed elsewhere (Lévy etal., 2014). Of note, the modules as defined in Chaussabel etal. (2008) were not overlapping: each gene contributed to only one module. To use the present approach with overlapping groups of genes (such as in Gene Ontology), an extension in the spirit of Jacob etal. (2009) would be required.
3.7.2 Results
The selection process of the genes according to the mean square error of prediction criteria for each method is shown in Supplementary Figures S11–S13. Supplementary Table S5 shows the cumulative percentage of variation of the responses variables explained by the components according to the method. With three components, more than 80% of the variance could be explained. The sgPLS methods selected slightly more genes than the sPLS (respectively, 487 and 420 genes selected), but sgPLS selected fewer modules than the sPLS (respectively, 21 and 64 groups of genes selected). Of note, all the 21 groups of genes selected by the sgPLS were included in those selected by the sPLS method. sgPLS selected slightly more modules than gPLS (4 more, 14/21 in common). However, gPLS led to more genes selected than sgPLS (944). Supplementary Figure S10 displays Venn diagrams of the selected sets by the three methods, both at the gene and module levels. Therefore, in this context of hierarchical data, sgPLS ends up being more parsimonious than sPLS in term of modules and than gPLS in term of genes by taking the most predictive genes within a selected module (see Supplementary Table S6). Such results are consistent with those of the simulation study. The selection of the most predictive genes inside a module improves the prediction capacity and does not avoid any interpretation of the biological significance of the remaining genes inside a module based on the biological knowledge available for this module.
Interestingly, the modules commonly selected by the three approaches were the most biologically relevant ones, such as the modules associated to inflammation (M3.2, M4.13, M4.2, M4.6, M7.1) and cellular responses (M3.6 cytotoxic/NK cell, 4.15 on T cells). A significant number of additional modules have been selected by sPLS but not by the other methods. Although, some are biologically sound (e.g. M3.1 Erythrocytes), many were not annotated or are unrelated to the immune system making the hypothesis of false signals likely. Regarding the modules differentially selected by sgPLS and gPLS, three modules were selected by gPLS but not by sgPLS (M4.11 Plasma cells and undetermined modules 4.8 and 7.35), whereas seven modules were selected by sgPLS and not by gPLS (M3.5 and M4.7 Cell cycle, M4.1 T cell, M5.1 and M5.7 inflammation, M6.7 and M5.2 undetermined). Clearly, the selection by sgPLS sounds more biologically relevant as M4.1 is known to be associated to the other T-cell module M4.15 that was selected by both methods (see www.biir.net/public_wikis/module_annotation/V2_Trial_8_Modules). This is the same for the inflammatory modules M5.1 and M5.7 that are known to be related. In regards of the module M4.11 (plasma cells), we have already notice that its selection was not robust in bootstrap analyses performed with sPLS (data not shown). Therefore, in this application, the sgPLS approach led to a parsimonious selection of modules and genes that sound very relevant biologically and is in agreement with the simulation results.
An illustration of the results is shown through the correlation matrix in Supplementary Figure S14 according to the genes selected in common by the three methods. Genes and modules negatively or positively associated to the immune response appeared very clearly. As expected, inflammatory modules were negatively correlated to the immune response whereas modules related to the cellular immune response (M3.6, M4.15) were positively correlated to the immune response.
Stability of the module selection has been assessed for sPLS, gPLS and sgPLS (see Supplementary Fig. S15) in the spirit of Bach (2008), Meinshausen and Bühlmann (2010) and Lê Cao etal. (2011a, b). It highlights the insights gained from incorporating the grouping structure into the analysis, comforting the above biological conclusions. Also, it reveals the instability of the selection of the module 4.11 by sPLS and gPLS, module that was not selected by sgPLS. Furthermore, we compared the proposed novel approach with the multivariate lasso implemented in glmnet R package (Friedman etal., 2010) and a multivariate (sparse) group lasso proposed by Li etal. (2015). The two alternative approaches selected less modules and genes but the modules selected were most often the same across the different methods (see Supplementary Materials, Section S1.7).
Funding
This research was supported by the NSERC of Canada (to P.L.d.M.). The second author also thanks the GENES.
Conflict of Interest: none declared.
References
Author notes
Associate Editor: Janet Kelso