Skip to Main Content

Article Navigation

Journal Article

Pathway-structured predictive modeling for multi-level drug response in multiple myeloma

Author Notes

Abstract

Motivation

Molecular analyses suggest that myeloma is composed of distinct sub-types that have different molecular pathologies and various response rates to certain treatments. Drug responses in multiple myeloma (MM) are usually recorded as a multi-level ordinal outcome. One of the goals of drug response studies is to predict which response category any patients belong to with high probability based on their clinical and molecular features. However, as most of genes have small effects, gene-based models may provide limited predictive accuracy. In that case, methods for predicting multi-level ordinal drug responses by incorporating biological pathways are desired but have not been developed yet.

Results

We propose a pathway-structured method for predicting multi-level ordinal responses using a two-stage approach. We first develop hierarchical ordinal logistic models and an efficient quasi-Newton algorithm for jointly analyzing numerous correlated variables. Our two-stage approach first obtains the linear predictor (called the pathway score) for each pathway by fitting all predictors within each pathway using the hierarchical ordinal logistic approach, and then combines the pathway scores as new predictors to build a predictive model. We applied the proposed method to two publicly available datasets for predicting multi-level ordinal drug responses in MM using large-scale gene expression data and pathway information. Our results show that our approach not only significantly improved the predictive performance compared with the corresponding gene-based model but also allowed us to identify biologically relevant pathways.

Availability and implementation

The proposed approach has been implemented in our R package BhGLM, which is freely available from the public GitHub repository https://github.com/abbyyan3/BhGLM.

1 Introduction

Multiple myeloma (MM) is a malignant plasma cell disorder with approximately 30 770 new cases are expected to be diagnosed and 12 770 deaths are expected to occur in the United States in 2018 (American Cancer, 2018; Kyle and RajKumar, 2008; Terragna et al., 2016). One of the major advances in the treatment regimen of MM patients has been the introduction of novel agents, such as the proteasome inhibitors (e.g. bortezomib) and immunomodulatory drugs (e.g. thalidomide), leading to the significant improvement in both overall survival and progression-free survival in MM in the last decade (Fonseca et al., 2017; Hofman et al., 2017). However, the heterogeneity exists in the patients’ response to those new treatments, and molecular features responsible for the variability in response remain undefined (Kumar et al., 2008; Malek et al., 2017; Rajkumar, 2011). Molecular analyses suggest that myeloma is composed of distinct sub-types that have different molecular pathologies and various response rates to certain treatments (Mulligan et al., 2007). Previous studies found that gene expression profiling is associated with the drug response variability of bortezomib, resulting in various disease prognoses (Mulligan et al., 2007; Terragna et al., 2016). However, most of the discovered genes have small effects, and thus contribute a trivial portion to complex drug response (Geeleher et al., 2014). It urges more efforts to develop predictive models using multiple molecular markers of MM patients to provide more precise prognostic and predictive classifiers for a specific therapy (Kumar et al., 2008).

Drug responses in MM are usually recorded as a multi-level ordinal outcome (Mulligan et al., 2007; Terragna et al., 2016). According to European Group for Bone Marrow Transplantation criteria, drug responses in MM were classified as achieving complete response (CR), partial response (PR), minimal response (MR), no change (NC) and progressive disease (PD; Blade et al., 1998). One of the goals of drug response studies is to predict which response category any patients belong to with high probability based on their clinical and molecular features. However, previous analyses, as in Mulligan et al. (2007) and Terragna et al. (2016), combined the five-level ordinal drug responses to form a binary outcome and only predicted two response categories. Such analyses fail to use full data information and can be arbitrary in selecting the recode strategy (Warner, 2008). To avoid these problems and obtain more informative and accurate prediction, therefore, it is desirable to directly analyze the multi-level drug response, rather than a simplified binary outcome.

Predictive modeling with molecular profiling requires statistical methods to properly handle at least thousands of molecular predictors. Since most of the genes have small effects on drug responses, gene-based models may provide limited predictive accuracy. One possible way to improve drug response prediction is to incorporate valuable biological information. It has been noted that the genetic nature of cancer is pathway-based, that is, oncogenes can be grouped into pathways based on biological functions such as cell survival, proliferation and metastatic dissemination (Barillot, 2012; Huang et al., 2014). Therefore, incorporating pathway information into predictive modeling could allow us to combine weak signals from a number of genes within each pathway and thus increase the power in prediction and prognosis.

Various pathway-based models for predicting censored survival outcomes have been proposed and widely used (Abraham et al., 2010; Eng et al., 2013; Huang et al., 2014; Lee et al., 2008; Reyal et al., 2008; Teschendorff et al., 2010). Especially, we have proposed an efficient two-stage approach to incorporate pathway information into the prognostic models using large-scale gene expression data for cancer survival prediction, which can be directly applicable in MM survival prediction (Zhang et al., 2017). However, methods for predicting multi-level ordinal drug responses by incorporating biological pathways have not been developed yet.

In this article, we propose a pathway-structured method for predicting multi-level ordinal responses by extending our two-stage approach for cancer survival prediction. We first develop hierarchical ordinal logistic models and an efficient quasi-Newton algorithm for jointly analyzing numerous correlated variables. Our two-stage approach first obtains the linear predictor (called the pathway score) for each pathway by fitting all predictors within each pathway using the proposed hierarchical ordinal logistic approach, and then combines the pathway scores as new predictors to build a predictive model. We applied the proposed method to two publicly available datasets for predicting multi-level ordinal drug responses in MM using large-scale gene expression data and pathway information (Mulligan et al., 2007; Terragna et al., 2016). Our results show that our two-stage approach not only significantly improved the predictive performance compared with the corresponding gene-based model but also identified biologically relevant pathways.

2 Materials and methods

2.1 Hierarchical ordinal logistic models

2.1.1 The model

Let y_i be the ordinal outcome for the ith individual and x_ij the gene expression value for the ith individual and jth gene, where i = 1, ···, n and j = 1, ···, J. The property of ordinal outcomes is that there exists a clear ordering of the response categories, but no underlying interval scale between them. Thus, it does not make sense to treat ordinal responses as numeric values. For notational convenience, we code the ordinal outcome as the integers 1, 2, ···, K, with K being the number of categories. The commonly used method for analyzing the ordinal outcome is the ordinal logistic regression(Gelman and Hill, 2007; Venables and Ripley, 2002), which can be expressed as:

\Pr (y_{i} = k) = {\begin{cases} 1 - {logit}^{- 1} (X_{i} β - c_{1}) if k = 1 \\ {logit}^{- 1} (X_{i} β - c_{k - 1}) - {logit}^{- 1} (X_{i} β - c_{k}) if 1 < k < K \\ {logit}^{- 1} (X_{i} β - c_{K - 1}) if k = K \end{cases}

(1)

where the vector X_i = (x_i₁, · ··, x_iJ) includes the expression measures of the J genes, β = (β₁, · ··, β_J)^T is a vector of the effects and the parameters c_k, called cut-points or thresholds, are constrained to increase,

c_{1} < \dots < c_{K - 1}

⁠, because the probabilities defined in Equation (1) are non-negative.

The ordinal logistic regressions are usually fitted by the maximum likelihood procedure as implemented by the function polr in the R core package MASS (Venables and Ripley, 2002). However, the classical ordinal regressions are not appropriate for jointly analyzing a large number of and/or highly-correlated predictor variables, due to the problems of non-identifiability and overfitting. One of the approaches to overcoming the problems is hierarchical modeling, that is, the parameters in the model are themselves modeled (i.e. given a prior distribution; Gelman et al., 2014; Gelman and Hill, 2007). Hierarchical modeling constraints the coefficients to lie in reasonable ranges, which allows the model to be reliably fitted and to identify important predictors. We employ the commonly used Cauchy prior distribution on the coefficients in the ordinal logistic regression:

p (β_{j}) = Caucy (0, s) = \frac{1}{π s} \frac{1}{1 + β_{j}^{2} / s^{2}}

(2)

The scale parameter s controls the amount of shrinkage in the coefficient estimate; smaller s induces stronger shrinkage on small coefficients and forces more coefficients towards zero. Cauchy priors are long-tailed, thus performing less shrinkage for large coefficients and leading to robust inferences (Gelman et al., 2008). We will discuss the choice of the scale shortly. For the cut-point parameters c_k, we use a uniform prior.

2.1.2 The algorithm

We fit the hierarchical ordinal models by finding the posterior modes of the parameters, i.e. estimating the parameters by maximizing the posterior density. The logarithm of the posterior density can be expressed as:

log p (β, c | y, X) \propto \sum_{i = 1}^{n} log [\Pr (y_{i} | X_{i} β, c)] + \sum_{j = 1}^{J} log p (β_{j})

(3)

where the first term is the log-likelihood defined by the ordinal logistic model Equation (1), and the second term includes the logarithms of Cauchy prior densities for the coefficients. If the prior on the coefficients is uniform (i.e. the classical ordinal regression), the posterior mode corresponds to the maximum likelihood estimate of the parameters.

We use the quasi-Newton algorithm, called BFGS, to estimate the posterior mode (⁠ $\hat{β}, \hat{c}$ ⁠) by iteratively maximizing the log posterior density. The quasi-Newton algorithm is an extension of the Newton–Raphson algorithm, which requires the vector of derivatives and matrix of second derivatives of the log posterior density. The BFGS algorithm forms an approximation of matrix of second derivatives using only gradient information, thus avoiding costly computation and storage. We develop the quasi-Newton algorithm by adding the derivatives and second derivatives of the log Cauchy priors to those of the log posterior density. The derivatives and second derivatives of the log Cauchy priors cane easily calculated explicitly. We implement our algorithm by altering the commonly used function polr in the R package MASS, which fits classical ordinal logistic regressions using the quasi-Newton algorithm.

The quasi-Newton algorithm also can return covariance matrix $Cov (\hat{β}, \hat{c})$ ⁠. Thus, we can test the hypothesis H₀: β_j = 0 by using the statistic $U_{j} {/ \hat{β}}_{j} / \sqrt{Var ({\hat{β}}_{j})}$ ⁠, which approximately follows the standard normal distribution. The proposed quasi-Newton algorithm is stable and fast, usually taking seconds for fitting a hierarchical ordinal logistic model with hundreds of covariates.

2.1.3 Model assessment

After fitting a hierarchical ordinal logistic model, we obtain the estimate (⁠ $\hat{β}, \hat{c}$ ⁠) and then can estimate the probabilities: $p_{i k} = Pr (y_{i} = k | X_{i} \hat{β}, \hat{c}), i = 1, \dots, n; k = 1, \dots, K$ ⁠. Denote y_ik = I(y_i = k) as the binary indicator response for the k-th category. With the estimated probabilities p_ik, we can evaluate the fitted model using the following measures:

Deviance: $d = - 2 \sum_{i = 1}^{n} log p_{i k}$ ⁠. Deviance measures the overall quality of a fitted model.
AUC (area under the ROC curve). We can calculate AUC for the k-th category using y_ik and p_ik, i = 1, ···, n, as usual. Then the AUC for all the categories is defined as $\frac{1}{K} \sum_{k = 1}^{K} {AUC}_{k}$ ⁠.
MSE (mean squared error). MSE is defined as $MSE = \frac{1}{K} \sum_{k = 1}^{K} [\frac{1}{n} \sum_{i = 1}^{n} (y_{i k} - p_{i k})^{2}]$ ⁠.
Mis-classification. The mis-classification is defined as: $MIS = \frac{1}{K} \sum_{k = 1}^{K} [\frac{1}{n} \sum_{i = 1}^{n} I (| y_{i k} - p_{i k} | > 0.5)]$ ⁠, where $I (| y_{i k} - p_{i k} | > 0.5) = 1$ if $| y_{i k} - p_{i k} | > 0.5$ ⁠, and $I (| y_{i k} - p_{i k} | > 0.5) = 0$ if $| y_{i k} - p_{i k} | \leq 0.5$ ⁠.

To evaluate the predictive performance of the model, we use a cross-validation procedure to calculate the values of the above measures (called the cross-validated measures; Hastie et al., 2015; Tibshirani and Efron, 2002). For an H-fold cross-validation, we randomly split the data into H subsets of roughly the same size. Using the (H – 1) subsets excluding the h-th subset, we obtain the estimate $({\hat{β}}^{(- h)}, {\hat{c}}^{(- h)})$ and then calculate the probabilities, $p_{i k} = Pr (y_{i} = k | X_{i} {\hat{β}}^{(- h)}, {\hat{c}}^{(- h)}), k = 1, \dots, K$ for all individuals of the h-th subset. Cycling through H parts, we obtain the probabilities p_ik for all individuals and then can calculate the measures defined above. To get more stable results, we can run H-fold cross-validation multiple times and use the average of the measure over the repeats to assess the predictive performance. We also can use leave-one-out cross-validation (LOOCV; i.e. H = n) to obtain unique result.

2.1.4 Selecting an optimal scale value

The performance of the hierarchical ordinal model can depend on the scale parameter in Cauchy prior. We fit a sequence of models with different scales covering reasonable values, from which we can choose an optimal one based on the cross-validated deviance (i.e. the lowest deviance) as described above.

2.2 Two-stage approach for incorporating pathway information

In principle, we can directly fit all available genes to build a predictive model with the above hierarchical modeling approach. However, it can be more efficient to use a two-stage approach that incorporates pathway information (Zhang et al., 2017). We propose the two-stage approach for predicting multi-level ordinal outcomes based on the hierarchical ordinal logistic models. Suppose that genes are assigned into G pathways, $G_{g} : g = 1, \dots, G,$ with the g-th pathway G_g containing J_g genes, and denote the vector of predictors in the g-th pathway by X^g. Overlapping is common in pathways analysis, that is, a gene could belong to multiple pathways. The two-stage approach can easily deal with overlapping.

In the first stage, we separately analyze each pathway by fitting all the predictors X^g within each pathway using the hierarchical ordinal logistic regression with an optimal scale. We can obtain the estimate of the linear predictor (called the pathway score),

η_{i}^{g} = X_{i}^{g} {\hat{β}}^{g}

⁠, for each individual at each pathway. In the second stage, we can use the pathway scores as predictors to build a predictive model with new independent data. With a single dataset, however, directly using the pathway scores to build a predictive model in the second stage can result in overfitting. To prevent overfitting, we estimate the cross-validated pathway scores using the LOOCV. To estimate the cross-validated pathway score of the i-th individual at the g-th pathway, we first estimate the coefficients in the g-th pathway using all the other (n – 1) individuals (i.e. excluding the i-th individual). Denote the estimate of the coefficients by

{\hat{β}}^{g (- i)}

⁠, the cross-validated score is then expressed as:

η_{(CV, i)}^{g} = X_{i}^{g} {\hat{β}}^{g (- i)}

(4)

In the second stage, we fit a hierarchical ordinal regression with an optimal scale using the cross-validated pathway scores as predictors:

\Pr (y_{i} = k) = {\begin{cases} 1 - {logit}^{- 1} (η_{(CV, i)} α - c_{1}) if k = 1 \\ {logit}^{- 1} (η_{(CV, i)} α - c_{k - 1}) - {logit}^{- 1} (η_{(CV, i)} α - c_{k}) if 1 < k < K \\ {logit}^{- 1} (η_{(CV, i)} α - c_{K - 1}) if k = K \end{cases}

(5)

where

η_{(CV, i)} = (η_{(CV, i)}^{1}, \dots, η_{(CV, i)}^{G})

is the vector of the cross-validated pathway scores,

α = {(α_{1}, \dots, α_{G})}^{T}

is the vector of the pathway effects, K is the number of categories and c_k are the cut-points as defined in Model (1). The pathways with the cross-validated AUC lower than 0.5 provide no predictive information. Thus, we can exclude these pathways from the second-stage modeling.

We also use the quasi-Newton algorithm to fit the above second-stage hierarchical ordinal regression. The quasi-Newton algorithm can calculate the P-value for testing the hypothesis H₀: α_g = 0. Thus, this hierarchical ordinal model can be used to not only predict the multi-level response but also identify important pathways. The predictive performance of the second-stage hierarchical ordinal model is evaluated using the cross-validation procedure as described earlier.

3 Computer software

We have implemented the proposed approach in our R package BhGLM, which is freely available from the public GitHub repository https://github.com/abbyyan3/BhGLM. The function bpolr sets up and fits the proposed hierarchical ordinal logistic models, predict.bh calculates the measures for evaluating the fitted model and cv.bh performs the cross-validation to assess the prediction accuracy. BhGLM also includes functions to numerically and graphically summarize the fitted model. The function bpolr is an alteration of polr in the R package MASS for fitting classical ordinal logistic regressions using the quasi-Newton algorithm.

4 Real data applications

4.1 Data summary and pathway analysis

We applied the two-stage approach to two publicly available datasets for predicting multi-level ordinal drug responses in MM. Both datasets were acquired from GEO under accession number [GEO: GSE9782] and [GEO: GSE68871]. These two datasets were generated from two independent clinical trials originally published in Mulligan et al. (2007) and Terragna et al. (2016). Mulligan et al. (2007) recruited patients with relapsed myeloma enrolled in phase 2 and phase 3 clinical trials of bortezomib. There were 169 samples with a total of 22 283 gene expression probes. Terragna et al. (2016) recruited 118 new MM patients with the induction therapy of VTD and measured a total of 54 677 gene expression probes with the Affymetrix Human Genome U133 Plus 2.0 Array. A gene represents a covariate. In our analyses, we standardized each gene expression covariate to have a mean of 0 and a SD of 1. It is important to standardize the predictors to a common and interpretable scale in hierarchical modeling (Gelman and Hill, 2007; Gelman et al., 2008).

To construct the pathways, we used genome annotation tools, KEGG (Kanehisa and Goto, 2000), to map genes to pathways. We mapped all the probes to KEGG pathways using the Bioinformatics tool DAVID (Huang et al., 2009; Huang da et al., 2009). For Mulligan et al. (2007), 7692 genes were mapped to 170 pathways. For Terragna et al. (2016), 12 036 genes were mapped to 146 pathways. The data summary and the pathways of the two studies are presented in Table 1. We used the genes that were mapped into pathways to build predictive models for multi-level ordinal outcomes.

Table 1.

Data summary and pathways of the two studies used in the analyses

Study	Mulligan et al. (2007)	Terragna et al. (2016)
Treatment	Bortezomib	VTD
Patients population	Relapsed MM	New-diagnosis
Number of samples	169	118
Number of genes	22 283	54 677
Number of genes in pathways	7692	12 036
Number of pathways	170	146

Study	Mulligan et al. (2007)	Terragna et al. (2016)
Treatment	Bortezomib	VTD
Patients population	Relapsed MM	New-diagnosis
Number of samples	169	118
Number of genes	22 283	54 677
Number of genes in pathways	7692	12 036
Number of pathways	170	146

Table 1.

Data summary and pathways of the two studies used in the analyses

Study	Mulligan et al. (2007)	Terragna et al. (2016)
Treatment	Bortezomib	VTD
Patients population	Relapsed MM	New-diagnosis
Number of samples	169	118
Number of genes	22 283	54 677
Number of genes in pathways	7692	12 036
Number of pathways	170	146

Study	Mulligan et al. (2007)	Terragna et al. (2016)
Treatment	Bortezomib	VTD
Patients population	Relapsed MM	New-diagnosis
Number of samples	169	118
Number of genes	22 283	54 677
Number of genes in pathways	7692	12 036
Number of pathways	170	146

4.2 Definitions of multi-level ordinal drug responses

According to European Group for Bone Marrow Transplantation criteria (Blade et al., 1998), patients were classified as achieving CR, PR, MR, NC and PD in Mulligan et al. (2007). In Terragna et al. (2016), patients’ drug responses were also classified as five categories: CR, near complete response (nCR), very good partial response (VGPR), PR and stable disease (SD). The five-level ordinal drug responses and their proportions in these two studies are summarized in Table 2.

Table 2.

The five-level drug ordinal outcomes, their numbers and proportions in the two datasets used in the analyses

	Response level	Number of patients (Proportion)
Mulligan et al. (2007)
Five-level outcome	PD	13 (7.70%)
	NC	60 (35.50%)
	MR	12 (7.10%)
	PR	43 (25.44%)
	CR	41 (24.26%)
Terragna et al. (2016)
Five-level outcome	SD	7 (5.93%)
	PR	42 (35.59%)
	VGPR	40 (33.90%)
	nCR	14 (11.87%)
	CR	15 (12.71%)

	Response level	Number of patients (Proportion)
Mulligan et al. (2007)
Five-level outcome	PD	13 (7.70%)
	NC	60 (35.50%)
	MR	12 (7.10%)
	PR	43 (25.44%)
	CR	41 (24.26%)
Terragna et al. (2016)
Five-level outcome	SD	7 (5.93%)
	PR	42 (35.59%)
	VGPR	40 (33.90%)
	nCR	14 (11.87%)
	CR	15 (12.71%)

Table 2.

The five-level drug ordinal outcomes, their numbers and proportions in the two datasets used in the analyses

	Response level	Number of patients (Proportion)
Mulligan et al. (2007)
Five-level outcome	PD	13 (7.70%)
	NC	60 (35.50%)
	MR	12 (7.10%)
	PR	43 (25.44%)
	CR	41 (24.26%)
Terragna et al. (2016)
Five-level outcome	SD	7 (5.93%)
	PR	42 (35.59%)
	VGPR	40 (33.90%)
	nCR	14 (11.87%)
	CR	15 (12.71%)

	Response level	Number of patients (Proportion)
Mulligan et al. (2007)
Five-level outcome	PD	13 (7.70%)
	NC	60 (35.50%)
	MR	12 (7.10%)
	PR	43 (25.44%)
	CR	41 (24.26%)
Terragna et al. (2016)
Five-level outcome	SD	7 (5.93%)
	PR	42 (35.59%)
	VGPR	40 (33.90%)
	nCR	14 (11.87%)
	CR	15 (12.71%)

We mainly analyzed the original five-level ordinal drug responses. However, it can be seen that some categories have low frequencies, for example, PD and MR in in Mulligan et al. (2007), SD, nCR and CR in Terragna et al. (2016). To avoid low frequencies in some levels and to compare with the analysis of five-level responses, we combined the five-level drug response to construct a new three-level drug response and analyzed the new three-level drug response. We kept CR as one level, the worst two response groups as one level and the middle two levels as the third level. In Mulligan et al. (2007), we combined PD and NC as a new level, and PR and MR as another new level. Thus, the new three-level ordinal outcome consisted of 73 patients having a response as PD or NC, 55 patients having a response as MR or PR and 41 patients having CR. In Terragna et al. (2016), we combined SD and PR as a new level, and VGPR and nCR as another new level. Thus, the new three-level ordinal outcome consisted of 49 patients having a response as SD or PR, 54 patients having a response as VGPR or nCR and 15 patients having CR.

4.3 Building pathway-structured predictive model with two-stage approach

We used the proposed two-stage approach to build pathway-structured predictive models using gene expression covariates for the five- and three-level ordinal drug outcomes. For our real data analysis, it took an average of 1–1.5 min to generate LOOCV pathway score for each pathway. In total, it took 2.99 and 3.59 h in a desktop for the two datasets, respectively.

In the first stage, we calculated the LOOCV pathway score for each pathway and each patient by fitting all the genes in that pathway using a hierarchical ordinal logistic model. The procedure was then repeated for all the pathways. In the second stage, we used pathways with cross-validated AUC greater than 0.5 obtained in the first stage to build a predictive model. We found that nearly 65% of the pathways had the cross-validated AUC lower than (or very close to) 0.5. The second-stage predictive model was built with the LOOCV pathway scores as new predictors with the hierarchical ordinal model approach. A 10-fold cross-validation with 10 replicates was carried out to validate the predictive performance of the final predictive model.

The cross-validated deviance, AUC, MSE and mis-classification for the two-stage approach are presented in Table 3. For the original five-level ordinal drug response outcome in Mulligan et al. (2007), the cross-validated deviance of the final predictive model was 472.600 (± 4.483), AUC was 0.585 (± 0.013), MSE was 0.142 (± 0.001) and mis-classification was 0.193 (± 0.004). For the combined three-level ordinal drug response outcome in Mulligan et al. (2007), the cross-validated deviance of the final predictive model was 335.801 (± 3.879), AUC was 0.652 (± 0.009), MSE was 0.199 (± 0.002) and mis-classification was 0.287 (± 0.006).

Table 3.

Measures of predictive performance for the two-stage and gene-based model approaches from 10-fold cross-validation with 10 replicates

		Deviance	AUC	MSE	Mis-classification
Mulligan et al. (2007)
Five-level ordinal outcome	Two-stage approach	472.600 (4.483)	0.585 (0.013)	0.142 (0.001)	0.193 (0.004)
Five-level ordinal outcome	Gene-based model	540.712 (8.214)	0.584 (0.014)	0.153 (0.002)	0.216 (0.006)
Three-level ordinal outcome	Two-stage approach	335.801 (3.879)	0.652 (0.009)	0.199 (0.002)	0.287 (0.006)
Three-level ordinal outcome	Gene-based model	433.340 (8.695)	0.593 (0.011)	0.244 (0.004)	0.391 (0.007)
Terragna et al. (2016)
Five-level ordinal outcome	Two-stage approach	264.827 (6.176)	0.799 (0.012)	0.117 (0.003)	0.166 (0.011)
Five-level ordinal outcome	Gene-based model	402.730 (8.526)	0.670 (0.012)	0.153 (0.002)	0.225 (0.008)
Three-level ordinal outcome	Two-stage approach	163.405 (7.289)	0.834 (0.011)	0.134 (0.006)	0.186 (0.012)
Three-level ordinal outcome	Gene-based model	262.964 (6.490)	0.631 (0.016)	0.211 (0.005)	0.314 (0.010)

		Deviance	AUC	MSE	Mis-classification
Mulligan et al. (2007)
Five-level ordinal outcome	Two-stage approach	472.600 (4.483)	0.585 (0.013)	0.142 (0.001)	0.193 (0.004)
Five-level ordinal outcome	Gene-based model	540.712 (8.214)	0.584 (0.014)	0.153 (0.002)	0.216 (0.006)
Three-level ordinal outcome	Two-stage approach	335.801 (3.879)	0.652 (0.009)	0.199 (0.002)	0.287 (0.006)
Three-level ordinal outcome	Gene-based model	433.340 (8.695)	0.593 (0.011)	0.244 (0.004)	0.391 (0.007)
Terragna et al. (2016)
Five-level ordinal outcome	Two-stage approach	264.827 (6.176)	0.799 (0.012)	0.117 (0.003)	0.166 (0.011)
Five-level ordinal outcome	Gene-based model	402.730 (8.526)	0.670 (0.012)	0.153 (0.002)	0.225 (0.008)
Three-level ordinal outcome	Two-stage approach	163.405 (7.289)	0.834 (0.011)	0.134 (0.006)	0.186 (0.012)
Three-level ordinal outcome	Gene-based model	262.964 (6.490)	0.631 (0.016)	0.211 (0.005)	0.314 (0.010)

Note: The values in parentheses are the standard errors over the replicates.

Table 3.

Measures of predictive performance for the two-stage and gene-based model approaches from 10-fold cross-validation with 10 replicates

		Deviance	AUC	MSE	Mis-classification
Mulligan et al. (2007)
Five-level ordinal outcome	Two-stage approach	472.600 (4.483)	0.585 (0.013)	0.142 (0.001)	0.193 (0.004)
Five-level ordinal outcome	Gene-based model	540.712 (8.214)	0.584 (0.014)	0.153 (0.002)	0.216 (0.006)
Three-level ordinal outcome	Two-stage approach	335.801 (3.879)	0.652 (0.009)	0.199 (0.002)	0.287 (0.006)
Three-level ordinal outcome	Gene-based model	433.340 (8.695)	0.593 (0.011)	0.244 (0.004)	0.391 (0.007)
Terragna et al. (2016)
Five-level ordinal outcome	Two-stage approach	264.827 (6.176)	0.799 (0.012)	0.117 (0.003)	0.166 (0.011)
Five-level ordinal outcome	Gene-based model	402.730 (8.526)	0.670 (0.012)	0.153 (0.002)	0.225 (0.008)
Three-level ordinal outcome	Two-stage approach	163.405 (7.289)	0.834 (0.011)	0.134 (0.006)	0.186 (0.012)
Three-level ordinal outcome	Gene-based model	262.964 (6.490)	0.631 (0.016)	0.211 (0.005)	0.314 (0.010)

		Deviance	AUC	MSE	Mis-classification
Mulligan et al. (2007)
Five-level ordinal outcome	Two-stage approach	472.600 (4.483)	0.585 (0.013)	0.142 (0.001)	0.193 (0.004)
Five-level ordinal outcome	Gene-based model	540.712 (8.214)	0.584 (0.014)	0.153 (0.002)	0.216 (0.006)
Three-level ordinal outcome	Two-stage approach	335.801 (3.879)	0.652 (0.009)	0.199 (0.002)	0.287 (0.006)
Three-level ordinal outcome	Gene-based model	433.340 (8.695)	0.593 (0.011)	0.244 (0.004)	0.391 (0.007)
Terragna et al. (2016)
Five-level ordinal outcome	Two-stage approach	264.827 (6.176)	0.799 (0.012)	0.117 (0.003)	0.166 (0.011)
Five-level ordinal outcome	Gene-based model	402.730 (8.526)	0.670 (0.012)	0.153 (0.002)	0.225 (0.008)
Three-level ordinal outcome	Two-stage approach	163.405 (7.289)	0.834 (0.011)	0.134 (0.006)	0.186 (0.012)
Three-level ordinal outcome	Gene-based model	262.964 (6.490)	0.631 (0.016)	0.211 (0.005)	0.314 (0.010)

Note: The values in parentheses are the standard errors over the replicates.

For the original five-level ordinal drug response outcome in Terragna et al. (2016), the cross-validated deviance of the final predictive model was 264.827 (± 6.176), AUC was 0.799 (± 0.012), MSE was 0.117 (± 0.003) and mis-classification was 0.166 (± 0.011). For the combined three-level ordinal drug response outcome in Terragna et al. (2016), the cross-validated deviance was 163.405 (± 7.289), AUC was 0.834 (± 0.011), MSE was 0.134 (± 0.006) and mis-classification was 0.186 (± 0.012).

4.4 Pathway-structured predictive model superior to gene-based model in prediction performance

We compared the two-stage approach with the gene-based model analysis that ignores the pathways information. For the gene-based model analysis, we simultaneously fit all the genes that were used in the two-stage approach using the hierarchical ordinal logistic model. We also used 10-fold cross-validation with 10 repeats to validate the predictive performance of the gene-based model. The cross-validated deviance, AUC, MSE and mis-classification for the gene-based model approach are presented in Table 3. We can see that the two-stage approach provided much lower deviance, MSE and mis-classification rate, and higher AUC than the gene-based models and thus significantly outperformed the gene-based models for both datasets.

4.5 Identified significant pathways and their biological relevance

Our two-stage approach is capable of identifying associated relevant pathways. Figure 1 shows the estimated effects and P-values of the identified significant pathways with the two-stage approach for the five-level ordinal outcome and gene expression data from Terragna et al. (2016). Four associated pathways have been identified with the significance level of 0.05. Among them, ErbB signaling pathway has been found to be frequently overexpressed by carcinoma cells (Lu and Kang, 2010; Normanno et al., 2006). Resistance to EGFR inhibitors (ErbB family of receptors) could be induced by frequent mutations within the EGFR pathway in MM (Chen et al., 2015). Aggarwal et al. (Aggarwal et al., 2006) discussed the role that chemokines and chemokine receptors play in the pathogenesis of MM and concluded that addition of chemokine antagonists to current treatment regimens for myeloma could result in better therapeutic responses. Podar et al. (Podar and Anderson, 2007) found that VEGF inhibitors could improve patient outcome in MM. On the other hand, five associated pathways have been identified with the significance level of 0.05 with the two-stage approach for the three-level ordinal outcome in Terragna et al. (Terragna et al., 2016). They are MAPK signaling pathway, Chemokine signaling pathway, Epithelial cell signaling pathway in Helicobacter pylori infection, influenza A pathway and Herpes simplex infection pathway, among which only one pathway is overlapped with the results from the five-level ordinal outcome.

Fig. 1.

The second-stage predictive model for the five-level ordinal outcome and gene expression data from Terragna et al. (Terragna et al., 2016): estimated values of pathway effects (points), 95% confidence intervals (short lines) and P-values (right side). Only significant pathways with P-value lower than 0.05 are labeled

Open in new tab Download slide

The second-stage predictive model for the five-level ordinal outcome and gene expression data from Terragna et al. (Terragna et al., 2016): estimated values of pathway effects (points), 95% confidence intervals (short lines) and P-values (right side). Only significant pathways with P-value lower than 0.05 are labeled

5 Discussion

We present a two-stage approach to incorporate the functional structure of pathways to predict response to drugs in cancer treatment. It is considered highly important to predict drug response for cancer patients, since the response to treatments, such as chemotherapeutic agents, are variable with potentially lethal side effects (Geeleher et al., 2014; Jiang and Wang, 2010). Our proposed method builds a predictive model based on an informative pathway-based data matrix, which is reduced from a high-dimensional large-scale gene matrix. There are two remarkable features in reducing the large-scale molecular matrix to a predictable pathway-based matrix: (i) it incorporates the correlation with drug response in calculating risk scores for each pathway; (ii) LOOCV was used to calculate the risk score, which not only prevents overfitting but also gives an unbiased summary of the contribution from different pathways to build the final pathway-structured prediction model.

An issue that rises in predicting cancer drug response is that the multi-level ordinal drug response will commonly be combined into a simplified binary variable. As we know, one major goal in drug response prediction is to determine the probability that a patient belongs to a certain response category based on their clinical and molecular features. Thus, the strategy to reduce the number of categories is considered to not only risk the loss of information in the data but also cannot provide informative accurate prediction (Warner, 2008). To address this issue, we employed a hierarchical ordinal regression model in our two-stage approach. It allows the direct modeling of the original multi-level ordinal drug response.

We propose to use Cauchy priors on coefficients in the hierarchical ordinal logistic models. Cauchy distributions have a peak at zero and long tails, thus performing weak shrinkage for large coefficients and strong shrinkage for irrelevant coefficients (Gelman et al. 2008, 2014), and also allow for easy and stable computation with the quasi-Newton algorithm. Some other priors can be used in the hierarchical ordinal logistic models. Possible choices may be the normal and double-exponential priors, which lead to the popular penalization approaches, ridge and lasso regressions, respectively (Hastie et al., 2015). The normal prior equally shrinks large and small coefficients and thus is not appropriate for identifying important predictors in high-dimensional settings. A remarkable feature of the double-exponential prior is that coefficients can be estimated exactly to zero if the model is fitted by the coordinate decent algorithm (Friedman et al., 2010; Simon et al., 2011). However, the coordinate decent algorithm has not been developed for ordinal logistic models. With the quasi-Newton algorithm, the double-exponential prior has no particular advantage.

We applied our two-stage approach in two publicly available datasets, which assessed responses to bortezomib in relapsed MM patients (Mulligan et al., 2007) and VTD in newly diagnosed MM patients (Terragna et al., 2016). Unlike the original papers which combined the five-level ordinal outcome as a binary outcome, we analyzed the datasets by using the original ordinal drug responses without reducing the number of categories for prediction. Our analysis results show that the predictive performance from VTD in new MM patients (Terragna et al., 2016) can be much improved than treating relapsed MM patients with bortezomib alone (Mulligan et al., 2007). To avoid low frequencies in several levels of ordinal five-level outcomes, we re-defined the five-level drug response as a three-level ordinal drug response in both datasets. For either original or re-defined outcome, the pathway-structured predictive models perform consistently better than the gene-based models using the hierarchical ordinal logistic regression model, in terms of much lower deviance, MSE, mis-classification rate and higher AUC, in both datasets. It is also noteworthy that AUC increased and deviance decreased with a trade-off in increased MSE and mis-classification rate in the analysis results using reduced three-level ordinal drug response, when compared with the results from the original outcome. It implies that combining ordinal outcome could result in higher MSE and mis-classification rate. Thus, it infers that combining ordinal outcome could lead to inaccurate informative prediction and mis-leading interpretation with possible arbitrary choices of recoding. These findings could also be supported by our identified associated pathways in VTD treated new MM patients (Terragna et al., 2016). Four associated pathways have been identified with the significance level of 0.05 using the five-level drug response in our two-stage approach, including ErbB signaling pathway, chemokines signaling pathway, VEGF signaling pathway and long-term potentiation pathway. Among them, the first three pathways have been widely investigated for their significant roles played in the pathogenesis of MM. On the other hand, there are five associated pathways identified with the significance level of 0.05 using the two-stage approach for the re-defined three-level ordinal outcome in VTD treated new MM patients (Terragna et al., 2016). They are MAPK signaling pathway, Chemokine signaling pathway, Epithelial cell signaling pathway in Helicobacter pylori infection, influenza A pathway and Herpes simplex infection pathway. There is only one pathway overlapped with the results from the five-level ordinal outcome. The differences between two outcomes also show that arbitrary recoding of ordinal response could be mis-leading in biological interpretation, thus severe incorrect information in the clinical application.

We note that only gene expression data have been used to build the pathway-structured model in this study. Our two-stage approach is directly applicable to include other clinical and molecular factors which may contribute to the predictive classifier of drug response in MM. Due to the limitations of the public datasets available, we did not incorporate those in real data analysis. We consider this matter as a subject for further work. It is also necessary to perform further research to assess the relevance of certain biomarkers in those associated pathways. Relevant biomarkers may provide extraordinary opportunities to elucidate the mechanisms of therapies in MM and to develop new diagnostics and therapeutic targets.

Funding

This work was supported in part by research grants from USA National Institutes of Health (R03-DE025646), National Natural Science Foundation of China (81673448, 81670191), Natural Science Foundation of Jiangsu Province China (BK20161218, BK20161223) and The Applied Basic Research Programs of Suzhou City (SYS201546). A Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Conflict of Interest: none declared.

References

Abraham

G.

et al. (

2010

)

Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context

.

BMC Bioinformatics

,

11

,

277.

Aggarwal

R.

et al. (

2006

)

Chemokines in multiple myeloma

.

Exper. Hematol

.,

34

,

1289

–

1295

.

American Cancer

Society

(

2018

)

Cancer Facts and Figures 2018

.

American Cancer Society

,

Atlanta

.

OpenURL Placeholder Text

Barillot

E.

(

2012

)

Computational Systems Biology of Cancer

.

Chapman & Hall/CRC mathematical and computational biology series

,

CRC Press, Boca Raton, FL

.

Blade

J.

et al. (

1998

)

Criteria for evaluating disease response and progression in patients with multiple myeloma treated by high-dose therapy and haemopoietic stem cell transplantation. Myeloma Subcommittee of the EBMT

.

European Group for Blood and Marrow Transplant

.

Br. J. Haematol

.,

102

,

1115

–

1123

.

Chen

Y.

et al. (

2015

)

Multiple myeloma acquires resistance to EGFR inhibitor via induction of pentose phosphate pathway

.

Sci. Rep

.,

5

,

9925.

Eng

K.H.

et al. (

2013

)

Pathway index models for construction of patient-specific risk profiles

.

Stat. Med

.,

32

,

1524

–

1535

.

Fonseca

R.

et al. (

2017

)

Trends in overall survival and costs of multiple myeloma, 2000-2014

.

Leukemia

,

31

,

1915

–

1921

.

Friedman

J.

et al. (

2010

)

Regularization paths for generalized linear models via coordinate descent

.

J. Stat. Softw

.,

33

,

1

–

22

.

Geeleher

P.

et al. (

2014

)

Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines

.

Genome Biol

.,

15

,

R47.

Gelman

A.

,

Hill

J.

(

2007

)

Data Analysis Using Regression and Multilevel/Hierarchical Models

.

Cambridge University Press

,

New York

.

OpenURL Placeholder Text

Gelman

A.

et al. (

2008

)

A weakly informative default prior distribution for logistic and other regression models

.

Ann. Appl. Stat

.,

2

,

1360

–

1383

.

OpenURL Placeholder Text

Gelman

A.

et al. (

2014

)

Bayesian Data Analysis.

Chapman and Hall/CRC Press

,

New York

.

OpenURL Placeholder Text

Hastie

T.

et al. (

2015

)

Statistical Learning with Sparsity–The Lasso and Generalization

.

CRC Press

,

New York

.

Hofman

I.J.F.

et al. (

2017

)

RPL5 on 1p22.1 is recurrently deleted in multiple myeloma and its expression is linked to bortezomib response

.

Leukemia

,

31

,

1706

–

1714

.

Huang

D.W.

et al. (

2009

)

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists

.

Nucleic Acids Res

.,

37

,

1

–

13

.

Huang da

W.

et al. (

2009

)

Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources

.

Nat. Protoc

.,

4

,

44

–

57

.

Huang

S.

et al. (

2014

)

A novel model to combine clinical and pathway-based transcriptomic information for the prognosis prediction of breast cancer

.

PLoS Comput. Biol

.,

10

,

e1003851.

Jiang

Y.

,

Wang

M.

(

2010

)

Personalized medicine in oncology: tailoring the right drug to the right patient

.

Biomarkers Med

.,

4

,

523

–

533

.

Kanehisa

M.

,

Goto

S.

(

2000

)

KEGG: kyoto encyclopedia of genes and genomes

.

Nucleic Acids Res

.,

28

,

27

–

30

.

Kumar

S.K.

et al. (

2008

)

Improved survival in multiple myeloma and the impact of novel therapies

.

Blood

,

111

,

2516

–

2520

.

Kyle

R.A.

,

Rajkumar

S.V.

(

2008

)

Multiple myeloma

.

Blood

,

111

,

2962

–

2972

.

Lee

E.

et al. (

2008

)

Inferring pathway activity toward precise disease classification

.

PLoS Comput. Biol

.,

4

,

e1000217.

Lu

X.

,

Kang

Y.

(

2010

)

Epidermal growth factor signalling and bone metastasis

.

Br. J. Cancer

,

102

,

457

–

461

.

Malek

E.

et al. (

2017

)

Pharmacogenomics and chemical library screens reveal a novel SCFSKP2 inhibitor that overcomes Bortezomib resistance in multiple myeloma

.

Leukemia

,

31

,

645

–

653

.

Mulligan

G.

et al. (

2007

)

Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib

.

Blood

,

109

,

3177

–

3188

.

Normanno

N.

et al. (

2006

)

Epidermal growth factor receptor (EGFR) signaling in cancer

.

Gene

,

366

,

2

–

16

.

Podar

K.

,

Anderson

K.C.

(

2007

)

Inhibition of VEGF signaling pathways in multiple myeloma and other malignancies

.

Cell Cycle

,

6

,

538

–

542

.

Rajkumar

S.V.

(

2011

)

Treatment of multiple myeloma

.

Nat. Rev. Clin. Oncol

.,

8

,

479

–

491

.

Reyal

F.

et al. (

2008

)

A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the proliferation, immune response and RNA splicing modules in breast cancer

.

Breast Cancer Res

,

10

,

R93.

Simon

N.

et al. (

2011

)

Regularization paths for Cox’s proportional hazards model via coordinate descent

.

J. Stat. Softw

.,

39

,

1

–

13

.

Terragna

C.

et al. (

2016

)

The genetic and genomic background of multiple myeloma patients achieving complete response after induction therapy with bortezomib, thalidomide and dexamethasone (VTD)

.

Oncotarget

,

7

,

9666

–

9679

.

Teschendorff

A.E.

et al. (

2010

)

Improved prognostic classification of breast cancer defined by antagonistic activation patterns of immune response pathway modules

.

BMC Cancer

,

10

,

604.

Tibshirani

R.J.

,

Efron

B.

(

2002

)

Pre-validation and inference in microarrays

.

Stat. Appl. Genet. Mol. Biol

.,

1

,

Article1.

Venables

W.N.

,

Ripley

B.D.

(

2002

)

Modern Applied Statistics with S

.

Springer-Verlag

,

New York

.

Warner

P.

(

2008

)

Ordinal logistic regression

.

J. Fam. Plann. Reprod. Health Care

,

34

,

169

–

170

.

Zhang

X.

et al. (

2017

)

Pathway-structured predictive model for cancer survival prediction: a two-stage approach

.

Genetics

,

205

,

89

–

100

.

Author notes

The authors wish it to be known that, in their opinion, the Xinyan Zhang and Bingzong Li authors should be regarded as Joint First Authors.

© The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Views

1,511

Altmetric

Total Views 1,511

1,047 Pageviews

464 PDF Downloads

Since 5/1/2018

Month:	Total Views:
May 2018	13
June 2018	57
July 2018	17
August 2018	20
September 2018	25
October 2018	170
November 2018	142
December 2018	21
January 2019	21
February 2019	16
March 2019	18
April 2019	18
May 2019	4
June 2019	14
July 2019	14
August 2019	3
September 2019	11
October 2019	12
November 2019	21
December 2019	10
January 2020	26
February 2020	11
March 2020	16
April 2020	21
May 2020	13
June 2020	36
July 2020	28
August 2020	7
September 2020	9
October 2020	14
November 2020	6
December 2020	9
February 2021	6
March 2021	16
April 2021	8
May 2021	8
June 2021	20
July 2021	11
August 2021	9
September 2021	6
October 2021	17
November 2021	13
December 2021	7
January 2022	7
February 2022	11
March 2022	17
April 2022	12
May 2022	16
June 2022	25
July 2022	22
August 2022	15
September 2022	50
October 2022	69
November 2022	32
December 2022	29
January 2023	29
February 2023	6
March 2023	15
April 2023	48
May 2023	8
June 2023	3
July 2023	15
August 2023	21
September 2023	6
October 2023	9
November 2023	9
December 2023	23
January 2024	16
February 2024	21
March 2024	16
April 2024	17