- Split View
-
Views
-
Cite
Cite
Florence H. Yong, Lu Tian, Sheng Yu, Tianxi Cai, L. J. Wei, Optimal stratification in outcome prediction using baseline information, Biometrika, Volume 103, Issue 4, December 2016, Pages 817–828, https://doi.org/10.1093/biomet/asw049
- Share Icon Share
A common practice in predictive medicine is to use current study data to construct a stratification procedure, which groups subjects according to baseline information and forms stratum-specific prevention or intervention strategies. A desirable stratification scheme would not only have small intra-stratum variation but also have a clinically meaningful discriminatory capability. We show how to obtain optimal stratification rules with such desirable properties from fitting a set of regression models relating the outcome to baseline covariates and creating scoring systems for predicting potential outcomes. We propose that all available optimal stratifications be evaluated with an independent dataset to select a final stratification. Lastly, we obtain inferential results for this selected stratification scheme with a holdout dataset. When only one study of moderate size is available, we combine the first two steps via crossvalidation. Extensive simulation studies are used to compare the proposed stratification strategy with alternatives. We illustrate the new proposal using an AIDS clinical trial for binary outcomes and a cardiovascular clinical study for censored event time outcomes.
1. Introduction
One important aspect of stratified medicine development is the ability to predict the response probability for individual subjects. To construct a prediction procedure for a future subject’s outcome via his or her baseline information, at the first step it is common practice to fit a parametric or semiparametric regression model in order to relate a subject’s outcome to his or her covariates. If the model is a reasonable approximation to the true one, the resulting predicted value for an individual would be close to his or her outcome value. Such predicted values create a scoring system for all future subjects, which can have systematic bias when the regression model is misspecified. To eliminate the bias, one may further calibrate the scoring system nonparametrically (Tian et al., 2014). However, the calibrated prediction for subjects with a given score can be quite unstable due to sparseness of the observations. Consequently, the resulting fine-level prediction procedure may perform poorly. Common practice is to group the scores into several strata and use the average observed outcome in each stratum to predict outcomes of new subjects classified into that stratum. However, the choice of strata is often ad hoc in practice. In this paper, we present a stratification procedure based on baseline covariates which provides a clinically meaningful grouping strategy.
In this article, we present an optimal stratification strategy incorporating model selection from a collection of candidates that satisfy certain criteria. Specifically, we show how to obtain an optimal grouping scheme for each candidate scoring system created from a regression model. For example, with the predicted response rates (1), we consider all possible discretization schemes with stratum sizes at least 10% of the study sample size and any monotonically increasing stratum-specific average response rates with an incremental value of at least |$0.2$| between consecutive strata. We then choose the final stratification to be the one that minimizes a certain overall prediction error among all such stratification schemes. Dynamic programming techniques are used to solve this optimization problem (Taha, 2003). With the data from the HIV example and model (1), our proposal results in three categories with stratum-specific average response rates of 11%, 42% and 69% and stratum sizes of 65, 343 and 129, respectively. If model (1) is appropriate, future patients classified to the first stratum may not benefit much from this rather costly three-drug combination therapy. We also describe how to evaluate competing models to select the final stratification scheme. Finally, we generalize the new proposal to handle censored event time outcomes and illustrate the procedure using data from a cardiovascular study (Braunwald et al., 2004).
2. Optimal stratification for a specific scoring system
Let |$Y$| be the outcome variable and |$V$| a vector of baseline covariates. Assume that the conditional mean |$\mu(V)= E(Y\,|\,V) $| is the parameter of interest for future prediction. To estimate |$\mu(V)$| when |$V$| is not a scalar, we generally use a working model which relates |$Y$| to |$Z$|, a vector transformation of |$V$| including components such as quadratic and interaction terms. Let |$\mu(V) =g (\beta^{ \mathrm{\scriptscriptstyle T} } Z),$| where |$g(\cdot)$| is a given smooth monotone function. Let the data consist of |$n$| independent copies |$\{(Y_i, V_i, Z_i), i=1,\ldots,n\}$| of |$(Y, V, Z)$|. An estimate |$\hat{\beta} $| for |$\beta$| can be obtained via a regularized estimation procedure such as the lasso (Tibshirani, 1996), especially when the dimension of |$Z$| is large. If the regression model is a reasonable approximation to the true one, the resulting estimator |$\hat{\mu}(V)=g(\hat{\beta}^{ \mathrm{\scriptscriptstyle T} } Z)$| would be close to |$\mu(V)$|, and a large |$\hat{\mu}(\cdot)$| indicates that the subject is expected to have a large outcome value |$Y$|. As an example, for the binary outcome |$Y$| in the HIV study discussed in |$\S\,$|1, one may use a logistic model with lasso or ridge regularization to obtain |$\hat{\mu}(\cdot)$|. The score (1) in |$\S\,$|1 gives the individual predicted response rate based on a simple additive logistic regression with four baseline covariates.
Minimizing the loss function (2) over all qualifying stratification schemes is a challenging problem. In the Supplementary Material, we show how to identify the boundary values |$\{\hat{c}_0,\ldots, \hat{c}_{\hat K} \}$| of the strata. We use (2) to evaluate the prediction error. The asymptotic optimality of the resulting stratification scheme is guaranteed by Lemma 1, whose justification is given in the Supplementary Material. This large-sample property of the optimal stratification scheme is also essential to ensure its stability under the crossvalidation setting discussed in |$\S\,$|3.
Suppose that the following hold.
(i) The regularized estimator |$\hat{\beta}$| of the working regression model converges to a constant vector |$\beta_0$| within a compact parameter space in probability.
(ii) The score estimate |$\hat{\mu}(v)$| converges uniformly to a deterministic function |$\tilde{\mu}(v)$| in probability on the support of |$V$|.
(iii) The score |$\tilde{\mu}(V)$| is a continuous random variable with bounded support and the joint density function of the continuous components of the random vector |$(Y, V^{ \mathrm{\scriptscriptstyle T} })^{ \mathrm{\scriptscriptstyle T} }$| is continuously differentiable.
(iv) The outcome |$Y$| has bounded support.
3. Selecting an optimal stratification scheme
One can obtain the optimal stratified prediction procedure for a prediction score system created by a given working regression model as presented in |$\S\,$|2. To make inferences about the resulting stratified prediction procedure, one may use an independent dataset. When only a single study is available, we may split its data into two separate parts, say, I and II. Using Part I data, we obtain the optimal stratification scheme. Then we use Part II data for inference. Moreover, if there is a collection of competing stratified score systems considered as potential candidates, we may further split the Part I data into two parts, say, Ia and Ib. The data from Part Ia are used for obtaining the optimal stratification schemes for each candidate scoring system as in |$\S\,$|2, whereas Part Ib data are used to evaluate all candidates and to select the best stratification scheme. This multi-stage procedure is similar to that proposed in Li et al. (2016).
Since Parts Ia and Ib may be small, one may use Monte Carlo crossvalidation (Xu & Liang, 2001; Yong et al., 2013) to obtain a more stable (3). Specifically, we randomly split the Part I dataset into Ia and Ib, say, |$N$| times. For the |$j$|th split, we repeat the above model building and evaluation procedure for each candidate model and obtain |$\mathcal{L}_{j}^*$| from (3). We then compute the average, |$\bar{\mathcal{L}}^* = N^{-1}\sum_{j=1}^N \mathcal{L}_j^*$|. For each candidate model, we refit the entire Part I data and denote the final realized stratification rule by |$\mathcal{M}^*$|. The pair (|$\bar{\mathcal{L}}^*, \mathcal{M}^*$|) reflects the magnitude of the estimated within-stratum variation and the complexity of each candidate model. The selection of an optimal stratification rule would be based on such pairs. With the data from Part II, we then construct confidence intervals for the stratum-specific mean values of the outcome variables for the selected final stratification scheme. Although crossvalidation may not choose the best stratification based on the Part I data, it identifies the procedure with the best average performance in constructing the stratification. Thus it is reasonable to expect that the stratification rule from the selected procedure will be highly competitive if not optimal.
We now use the data from the HIV study to illustrate our proposal. For this study, other than the four baseline variables discussed in |$\S\,$|1, there are seven additional baseline covariates and two short-term marker values at week 4, including CD4 count, denoted by |$\mbox{CD4}_{4}$|, and |$\log_{10} \mbox{RNA},$| denoted by |$\log_{10}\mbox{RNA}_4$|. The additional baseline covariates are race, injection-drug use, haemophilia, CD8 count, weight, Karnofsky performance score, and months of prior zidovudine therapy. Patients with missing covariate values account for 4% of the data. Any missing covariates are replaced by the corresponding sample averages.
We first randomly split the entire dataset of 537 patients into Part I and Part II, with sample sizes |$268$| and |$269$|. In the Monte Carlo crossvalidation, we take |$N{=}200$| and the sizes of Part Ia and Ib to be equal for each crossvalidation. For illustration, we consider four working models: three logistic regression models with lasso regularization methods and tuning parameters selected via a 20-fold crossvalidation procedure (Friedman et al., 2010), and a null model using the overall mean response proportion in Part Ia to predict future outcomes. Table 1 summarizes the composition of each model. We also present |$\mathcal{\bar{L}}^*,$| obtained by averaging the 200 |$\mathcal{L}^*_j$| values; and for the corresponding |$\mathcal{M}^*$|, we report the numbers of informative baseline covariates needed to compute the score |$\hat{\mu}(V)$| and nonzero regression coefficients in |$\hat{\beta}$| estimated based on the entire Part I data to summarize its complexity. In constructing all candidate stratification schemes, we use |$d_0 = 0.2$| and |$p_0=0.1$|.
This final selected stratification scheme |$\mathcal{M}^*$| has three strata with cut-offs |$\hat c_1 = 0.25$| and |$\hat c_2 = 0.45$|. The stratum-specific means are |$0.06$|, |$0.36$|, and |$0.62$|; and the corresponding stratum sizes are |$51$|, |$107$|, and |$110$|, based on the Part I data. These stratum-outcome-average estimates may be biased due to the extensive model building, evaluation and selection. To obtain valid inferences for this final stratification rule, we use the boundary values |$\hat c_1$| and |$\hat c_2$| to group subjects from Part II data. The resulting point estimates and |$0.95$| confidence intervals for the three stratum-average response rates are |$0.17$| (|$0.06$|, |$0.28$|), |$0.41$| (|$0.31$|, |$0.51$|) and |$0.65$| (|$0.57$|, |$0.73$|), with stratum sizes |$47, 91,$| and |$131$|; see the Supplementary Material.
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, sex, CD4 count and |$\log_{10}$|RNA at baseline and week 4 | 6 | |$0.42$| | 5 | 5 |
2 | all baseline covariates plus their first-order interaction terms | 78 | |$0.47$| | 12 | 14 |
3 | all baseline covariates and CD4 and |$\log_{10}$|RNA at week 4 | 105 | |$0.41$| | 13 | 20 |
plus their first-order interaction terms | |||||
4 | none | 0 | |$0.48$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, sex, CD4 count and |$\log_{10}$|RNA at baseline and week 4 | 6 | |$0.42$| | 5 | 5 |
2 | all baseline covariates plus their first-order interaction terms | 78 | |$0.47$| | 12 | 14 |
3 | all baseline covariates and CD4 and |$\log_{10}$|RNA at week 4 | 105 | |$0.41$| | 13 | 20 |
plus their first-order interaction terms | |||||
4 | none | 0 | |$0.48$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, sex, CD4 count and |$\log_{10}$|RNA at baseline and week 4 | 6 | |$0.42$| | 5 | 5 |
2 | all baseline covariates plus their first-order interaction terms | 78 | |$0.47$| | 12 | 14 |
3 | all baseline covariates and CD4 and |$\log_{10}$|RNA at week 4 | 105 | |$0.41$| | 13 | 20 |
plus their first-order interaction terms | |||||
4 | none | 0 | |$0.48$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, sex, CD4 count and |$\log_{10}$|RNA at baseline and week 4 | 6 | |$0.42$| | 5 | 5 |
2 | all baseline covariates plus their first-order interaction terms | 78 | |$0.47$| | 12 | 14 |
3 | all baseline covariates and CD4 and |$\log_{10}$|RNA at week 4 | 105 | |$0.41$| | 13 | 20 |
plus their first-order interaction terms | |||||
4 | none | 0 | |$0.48$| | 0 | 0 |
4. Asimulation study
To compare our stratification method with alternatives, we conducted a simulation study where data were generated mimicking the AIDS Clinical Trials Group trial described in |$\S\,$|1. Throughout the simulation, we used the observed baseline RNA and week 4 RNA levels as the covariates. In the first setting we simulated the binary response according to a logistic regression model with log|$_{10}$|-transformed baseline RNA, |$\log_{10}\mbox{RNA}_0$|, and week 4 RNA, |$\log_{10}\mbox{RNA}_4,$| and their product as the independent variables. The regression coefficients of the logistic regression model were set as the maximum likelihood estimators based on the original data. In the second setting, we doubled all the regression coefficients in the logistic regression model, and adjusted the intercept term to maintain the same overall prevalence rate. In the third setting, the binary response was simulated with different probabilities in nine subgroups summarized in Table 2, where the cut-off values |$(0.56, 2.04)\times 10^5$| and |$(0.22, 3.02)\times 10^3$| are tertiles of baseline and week 4 RNA levels, respectively. The response probability was chosen to be the observed response rate of the corresponding subgroup in the AIDS Clinical Trials Group 320 trial. In the last setting, the response probabilities of the aforementioned nine subgroups were modified to increase the between-group differences, as shown in Table 2.
. | . | Setting 3 . | . | Setting 4 . | ||||
---|---|---|---|---|---|---|---|---|
. | . | Baseline RNA |$(\times 10^5)$| . | Baseline RNA |$(\times 10^5)$| . | |||||
. | . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |
Week 4 | |$[0, 0.22]$| | 61 | 64 | 64 | 74 | 78 | 78 | |
RNA |$(\times 10^3)$| | |$(0.22, 3.02]$| | 47 | 48 | 44 | 47 | 48 | 40 | |
|$(3.02, \infty)$| | 15 | 19 | 27 | 3 | 6 | 14 |
. | . | Setting 3 . | . | Setting 4 . | ||||
---|---|---|---|---|---|---|---|---|
. | . | Baseline RNA |$(\times 10^5)$| . | Baseline RNA |$(\times 10^5)$| . | |||||
. | . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |
Week 4 | |$[0, 0.22]$| | 61 | 64 | 64 | 74 | 78 | 78 | |
RNA |$(\times 10^3)$| | |$(0.22, 3.02]$| | 47 | 48 | 44 | 47 | 48 | 40 | |
|$(3.02, \infty)$| | 15 | 19 | 27 | 3 | 6 | 14 |
. | . | Setting 3 . | . | Setting 4 . | ||||
---|---|---|---|---|---|---|---|---|
. | . | Baseline RNA |$(\times 10^5)$| . | Baseline RNA |$(\times 10^5)$| . | |||||
. | . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |
Week 4 | |$[0, 0.22]$| | 61 | 64 | 64 | 74 | 78 | 78 | |
RNA |$(\times 10^3)$| | |$(0.22, 3.02]$| | 47 | 48 | 44 | 47 | 48 | 40 | |
|$(3.02, \infty)$| | 15 | 19 | 27 | 3 | 6 | 14 |
. | . | Setting 3 . | . | Setting 4 . | ||||
---|---|---|---|---|---|---|---|---|
. | . | Baseline RNA |$(\times 10^5)$| . | Baseline RNA |$(\times 10^5)$| . | |||||
. | . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |$[0, 0.56]$| . | |$(0.56, 2.04]$| . | |$(2.04, \infty)$| . | |
Week 4 | |$[0, 0.22]$| | 61 | 64 | 64 | 74 | 78 | 78 | |
RNA |$(\times 10^3)$| | |$(0.22, 3.02]$| | 47 | 48 | 44 | 47 | 48 | 40 | |
|$(3.02, \infty)$| | 15 | 19 | 27 | 3 | 6 | 14 |
For each setting, the mean square errors for predicting future observations were estimated for various stratification rules and the calibrated continuous score. As references, the mean square errors were also computed for the null and true models, using the mean response rate in the training set and the true individual response rate as the risk prediction, respectively. For each stratification, the number of strata and the differences between neighbouring strata were recorded.
The results based on 1000 simulations are reported in Table 3. The mean square errors across different methods lie between those of the null and true models, as expected. Compared with other stratification methods, our method usually yields the smallest prediction error at a level comparable to that of the continuous score in all settings, in spite of the fact that the latter is asymptotically optimal. The stratification from classification and regression trees is competitive and tends to yield fewer strata. The strata constructed by the new proposal guarantee that a high proportion of between-strata differences is greater than the prespecified threshold and the violation is mild when that occurred. Not all the differences satisfy the constraint, due to the randomness of |$\bar{Y}_k - \bar{Y}_{k-1}$| in the training set.
Setting . | Stratification . | MSE . | Proportion of between-strata differences satisfying the constraint . | ||||
---|---|---|---|---|---|---|---|
. | method . | |$(\times 100)$| . | |$K=2$| . | |$K=3$| . | |$K=4$| . | |$K=5$| . | |$K=6$| . |
1 | New | |$22.1$| | 100 | |$53.0$| | |$27.4$| | ||
(|$13.1$|) | (|$81.3$|) | |$(5.6)$| | |||||
Median | |$22.5$| | 100 | |||||
Tertile | |$22.1$| | |$50.0$| | |||||
Quartile | |$22.2$| | |$0.0$| | |||||
CART | |$22.2$| | 100 | 50 | |$33.7$| | |$29.4$| | 10 | |
(|$50.4$|) | (|$39.0$|) | (|$8.7$|) | (|$1.7$|) | (|$0.2$|) | |||
Continuous | |$22.1$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.5$| | ||||||
2 | New | |$16.6$| | 100 | |$99.8$| | |$54.2$| | |$28.1$| | |
(|$0.1$|) | (|$52.6$|) | (|$46.5$|) | (|$0.8$|) | ||||
Median | |$18.2$| | 100 | |||||
Tertile | |$17.2$| | |$100.0$| | |||||
Quartile | |$17.3$| | |$52.4$| | |||||
CART | |$16.8$| | 100 | |$98.8$| | |$50.3$| | |$46.7$| | ||
(|$9.3$|) | (|$76.4$|) | (|$12.8$|) | (|$1.5$|) | ||||
Continuous | |$21.8$| | ||||||
Null model | |$24.9$| | ||||||
True model | |$16.0$| | ||||||
3 | New | |$22.6$| | 100 | |$43.7$| | |$33.3$| | ||
(|$33.4$|) | (|$66.3$|) | (|$0.3$|) | |||||
Median | |$22.9$| | 100 | |||||
Tertile | |$22.3$| | |$47.2$| | |||||
Quartile | |$22.6$| | 0 | |||||
CART | |$22.6$| | 100 | |$45.9$| | |$22.9$| | |$16.7$| | 25 | |
(|$59.7$|) | (|$27.9$|) | (|$9.6$|) | (|$1.8$|) | (|$0.4$|) | |||
Continuous | |$22.6$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.8$| | ||||||
4 | New | |$17.6$| | 100 | |$99.9$| | |$63.6$| | |$50.0$| | |
(|$0.1$|) | (|$75.5$|) | (|$24.1$|) | (|$0.3$|) | ||||
Median | |$19.3$| | 100 | |||||
Tertile | |$17.7$| | |$100.0$| | |||||
Quartile | |$18.4$| | |$66.7$| | |||||
CART | |$17.6$| | 100 | |$99.8$| | |$66.7$| | |$50.0$| | ||
(|$6.3$|) | (|$78.3$|) | (|$13.6$|) | (|$1.8$|) | ||||
Continuous | |$22.0$| | ||||||
Null model | |$24.6$| | ||||||
True model | |$17.0$| |
Setting . | Stratification . | MSE . | Proportion of between-strata differences satisfying the constraint . | ||||
---|---|---|---|---|---|---|---|
. | method . | |$(\times 100)$| . | |$K=2$| . | |$K=3$| . | |$K=4$| . | |$K=5$| . | |$K=6$| . |
1 | New | |$22.1$| | 100 | |$53.0$| | |$27.4$| | ||
(|$13.1$|) | (|$81.3$|) | |$(5.6)$| | |||||
Median | |$22.5$| | 100 | |||||
Tertile | |$22.1$| | |$50.0$| | |||||
Quartile | |$22.2$| | |$0.0$| | |||||
CART | |$22.2$| | 100 | 50 | |$33.7$| | |$29.4$| | 10 | |
(|$50.4$|) | (|$39.0$|) | (|$8.7$|) | (|$1.7$|) | (|$0.2$|) | |||
Continuous | |$22.1$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.5$| | ||||||
2 | New | |$16.6$| | 100 | |$99.8$| | |$54.2$| | |$28.1$| | |
(|$0.1$|) | (|$52.6$|) | (|$46.5$|) | (|$0.8$|) | ||||
Median | |$18.2$| | 100 | |||||
Tertile | |$17.2$| | |$100.0$| | |||||
Quartile | |$17.3$| | |$52.4$| | |||||
CART | |$16.8$| | 100 | |$98.8$| | |$50.3$| | |$46.7$| | ||
(|$9.3$|) | (|$76.4$|) | (|$12.8$|) | (|$1.5$|) | ||||
Continuous | |$21.8$| | ||||||
Null model | |$24.9$| | ||||||
True model | |$16.0$| | ||||||
3 | New | |$22.6$| | 100 | |$43.7$| | |$33.3$| | ||
(|$33.4$|) | (|$66.3$|) | (|$0.3$|) | |||||
Median | |$22.9$| | 100 | |||||
Tertile | |$22.3$| | |$47.2$| | |||||
Quartile | |$22.6$| | 0 | |||||
CART | |$22.6$| | 100 | |$45.9$| | |$22.9$| | |$16.7$| | 25 | |
(|$59.7$|) | (|$27.9$|) | (|$9.6$|) | (|$1.8$|) | (|$0.4$|) | |||
Continuous | |$22.6$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.8$| | ||||||
4 | New | |$17.6$| | 100 | |$99.9$| | |$63.6$| | |$50.0$| | |
(|$0.1$|) | (|$75.5$|) | (|$24.1$|) | (|$0.3$|) | ||||
Median | |$19.3$| | 100 | |||||
Tertile | |$17.7$| | |$100.0$| | |||||
Quartile | |$18.4$| | |$66.7$| | |||||
CART | |$17.6$| | 100 | |$99.8$| | |$66.7$| | |$50.0$| | ||
(|$6.3$|) | (|$78.3$|) | (|$13.6$|) | (|$1.8$|) | ||||
Continuous | |$22.0$| | ||||||
Null model | |$24.6$| | ||||||
True model | |$17.0$| |
New, the proposed method; Median, median-based stratification; Tertile, Tertile-based stratification; Quartile, Quartile-based stratification; CART, classification and regression tree; Continuous, calibrated continuous score.
Setting . | Stratification . | MSE . | Proportion of between-strata differences satisfying the constraint . | ||||
---|---|---|---|---|---|---|---|
. | method . | |$(\times 100)$| . | |$K=2$| . | |$K=3$| . | |$K=4$| . | |$K=5$| . | |$K=6$| . |
1 | New | |$22.1$| | 100 | |$53.0$| | |$27.4$| | ||
(|$13.1$|) | (|$81.3$|) | |$(5.6)$| | |||||
Median | |$22.5$| | 100 | |||||
Tertile | |$22.1$| | |$50.0$| | |||||
Quartile | |$22.2$| | |$0.0$| | |||||
CART | |$22.2$| | 100 | 50 | |$33.7$| | |$29.4$| | 10 | |
(|$50.4$|) | (|$39.0$|) | (|$8.7$|) | (|$1.7$|) | (|$0.2$|) | |||
Continuous | |$22.1$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.5$| | ||||||
2 | New | |$16.6$| | 100 | |$99.8$| | |$54.2$| | |$28.1$| | |
(|$0.1$|) | (|$52.6$|) | (|$46.5$|) | (|$0.8$|) | ||||
Median | |$18.2$| | 100 | |||||
Tertile | |$17.2$| | |$100.0$| | |||||
Quartile | |$17.3$| | |$52.4$| | |||||
CART | |$16.8$| | 100 | |$98.8$| | |$50.3$| | |$46.7$| | ||
(|$9.3$|) | (|$76.4$|) | (|$12.8$|) | (|$1.5$|) | ||||
Continuous | |$21.8$| | ||||||
Null model | |$24.9$| | ||||||
True model | |$16.0$| | ||||||
3 | New | |$22.6$| | 100 | |$43.7$| | |$33.3$| | ||
(|$33.4$|) | (|$66.3$|) | (|$0.3$|) | |||||
Median | |$22.9$| | 100 | |||||
Tertile | |$22.3$| | |$47.2$| | |||||
Quartile | |$22.6$| | 0 | |||||
CART | |$22.6$| | 100 | |$45.9$| | |$22.9$| | |$16.7$| | 25 | |
(|$59.7$|) | (|$27.9$|) | (|$9.6$|) | (|$1.8$|) | (|$0.4$|) | |||
Continuous | |$22.6$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.8$| | ||||||
4 | New | |$17.6$| | 100 | |$99.9$| | |$63.6$| | |$50.0$| | |
(|$0.1$|) | (|$75.5$|) | (|$24.1$|) | (|$0.3$|) | ||||
Median | |$19.3$| | 100 | |||||
Tertile | |$17.7$| | |$100.0$| | |||||
Quartile | |$18.4$| | |$66.7$| | |||||
CART | |$17.6$| | 100 | |$99.8$| | |$66.7$| | |$50.0$| | ||
(|$6.3$|) | (|$78.3$|) | (|$13.6$|) | (|$1.8$|) | ||||
Continuous | |$22.0$| | ||||||
Null model | |$24.6$| | ||||||
True model | |$17.0$| |
Setting . | Stratification . | MSE . | Proportion of between-strata differences satisfying the constraint . | ||||
---|---|---|---|---|---|---|---|
. | method . | |$(\times 100)$| . | |$K=2$| . | |$K=3$| . | |$K=4$| . | |$K=5$| . | |$K=6$| . |
1 | New | |$22.1$| | 100 | |$53.0$| | |$27.4$| | ||
(|$13.1$|) | (|$81.3$|) | |$(5.6)$| | |||||
Median | |$22.5$| | 100 | |||||
Tertile | |$22.1$| | |$50.0$| | |||||
Quartile | |$22.2$| | |$0.0$| | |||||
CART | |$22.2$| | 100 | 50 | |$33.7$| | |$29.4$| | 10 | |
(|$50.4$|) | (|$39.0$|) | (|$8.7$|) | (|$1.7$|) | (|$0.2$|) | |||
Continuous | |$22.1$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.5$| | ||||||
2 | New | |$16.6$| | 100 | |$99.8$| | |$54.2$| | |$28.1$| | |
(|$0.1$|) | (|$52.6$|) | (|$46.5$|) | (|$0.8$|) | ||||
Median | |$18.2$| | 100 | |||||
Tertile | |$17.2$| | |$100.0$| | |||||
Quartile | |$17.3$| | |$52.4$| | |||||
CART | |$16.8$| | 100 | |$98.8$| | |$50.3$| | |$46.7$| | ||
(|$9.3$|) | (|$76.4$|) | (|$12.8$|) | (|$1.5$|) | ||||
Continuous | |$21.8$| | ||||||
Null model | |$24.9$| | ||||||
True model | |$16.0$| | ||||||
3 | New | |$22.6$| | 100 | |$43.7$| | |$33.3$| | ||
(|$33.4$|) | (|$66.3$|) | (|$0.3$|) | |||||
Median | |$22.9$| | 100 | |||||
Tertile | |$22.3$| | |$47.2$| | |||||
Quartile | |$22.6$| | 0 | |||||
CART | |$22.6$| | 100 | |$45.9$| | |$22.9$| | |$16.7$| | 25 | |
(|$59.7$|) | (|$27.9$|) | (|$9.6$|) | (|$1.8$|) | (|$0.4$|) | |||
Continuous | |$22.6$| | ||||||
Null model | |$24.7$| | ||||||
True model | |$21.8$| | ||||||
4 | New | |$17.6$| | 100 | |$99.9$| | |$63.6$| | |$50.0$| | |
(|$0.1$|) | (|$75.5$|) | (|$24.1$|) | (|$0.3$|) | ||||
Median | |$19.3$| | 100 | |||||
Tertile | |$17.7$| | |$100.0$| | |||||
Quartile | |$18.4$| | |$66.7$| | |||||
CART | |$17.6$| | 100 | |$99.8$| | |$66.7$| | |$50.0$| | ||
(|$6.3$|) | (|$78.3$|) | (|$13.6$|) | (|$1.8$|) | ||||
Continuous | |$22.0$| | ||||||
Null model | |$24.6$| | ||||||
True model | |$17.0$| |
New, the proposed method; Median, median-based stratification; Tertile, Tertile-based stratification; Quartile, Quartile-based stratification; CART, classification and regression tree; Continuous, calibrated continuous score.
5. Event time as the outcome variable
If the outcome |$T$| is the time to a specific event, it may be censored and its mean or median may be badly estimated. A common summary parameter is the event rate at a specific time-point |$\tau,$| but this does not include information about the event occurrence profile. The restricted mean survival time has been argued to be a clinically meaningful summary for such a distribution (Royston, 2006; Royston & Parmar, 2011; Zhao et al., 2013). Specifically, let |$Y = TI(T \le \tau) + \tau I(T > \tau)$| and |$\mu(V) = E(Y \mid V) = \int_0^{\tau} S(t \mid V)\, {\rm d}t$| as defined in |$\S\,$|2, where |$S(t \mid V) = \mbox{pr}(T > t \mid V)$|. Here, |$\mu(V)$| is the average event-free time for all subjects with covariate |$V$|, which would be followed up to time-point |$\tau$|. The definition of restricted mean survival time and subsequent results depend on the choice of |$\tau,$| which is often set to be close to the maximum study follow-up time. In practice, |$T$| may be right-censored by an independent random variable |$C$|. However, one can always observe |$(X, V, \Delta),$| where |$X = \min(T,C)$| and |$\Delta=I(T\le C)$|. Therefore, the observed data consist of |$n$| independent copies |$\{(X_i, V_i, \Delta_i), i= 1,\ldots,n\}$| of |$(X, V, \Delta)$|. When |$\Delta_i=1$| or |$X_i \ge \tau$|, |$Y_i=\min(T_i, \tau)$| is observed.
To select the best scoring model from the competing scoring systems, one can use the procedure in |$\S\,$|3 with the weighted version of (3) to evaluate the candidate stratification schemes.
6. Example with censored event time outcomes
We use the data from a cardiovascular clinical trial, Prevention of Events with Angiotensin Converting Enzyme Inhibition, to illustrate the proposal with an event time outcome variable. The trial is a double-blind, placebo-controlled study (Braunwald et al., 2004) of 8290 patients enrolled to investigate whether the addition to the conventional therapy of an angiotensin-converting-enzyme inhibitor would provide benefit with respect to, for example, the patient’s specific cardiovascular event-free survival. The inhibitor is trandolpril at a target dose of 4 mg/day. The outcome is assumed to be the time to death, nonfatal myocardial infarction or coronary revascularization, whichever occurred first. There are 2110 patients who experienced this composite event, with a median follow-up time of 54 months. A |$0.95$| confidence interval for the hazard ratio is (|$0.86$|, |$1.02$|). Since there was no statistically significant treatment effect, we combined the data from the two treatment groups. The overall observed event times in months range between |$0.1$| and |$81.5$| with quartiles |$12.8$| and |$42.4$|. If we let |$\tau=72$| months, the estimated restricted mean event time for the entire group is |$60.4$| months. This suggests that for future patients in this study population, one expects to have an average of |$60.4$| months event-free with a follow-up time of 72 months.
Based on the results of Solomon et al. (2006), we considered the following baseline covariates for prediction: the study treatment indicator, age, gender, left ventricular ejection fraction, history of myocardial infarction, history of hypertension, history of diabetes, and estimated glomerular filtration rate as a four-category discretized version represented by three indicator variables with cut-points of 45, 60, and 75. Left ventricular ejection fraction is missing for 412 patients and other covariate information is almost complete, with no more than 11 missing values each. We imputed the missing covariate values with the averages for continuous variables and the most frequently observed category for binary variables. We then randomly split the data into Parts I and II, each consisting of 4145 patients, and split Part I randomly for crossvalidation with |$N=200$|. Several candidate models are listed in Table 4. Model 2 is built upon the observation that there is potential treatment and estimated glomerular filtration rate interaction (Solomon et al., 2006).
For each regression candidate model, |$d_0 = 3$| months and |$p_0=0.05$| in the Part Ia training data. Table 4 summarizes |$\bar{\mathcal{L}}^*$| for the optimal stratification based on each regression working model and the numbers of informative baseline covariates used in computing the estimated scores and nonzero regression coefficients of |$\hat{\beta}$| for |$\mathcal{M}^*$|. Model 2 has the smallest |$\bar{\mathcal{L}}^*$| and yields three strata with |$\hat{c}_1$| = |$56.5$| and |$\hat{c}_2$| = |$60.5$| months. The range of the estimated restricted mean survival time in the Part I data is from |$51.0$| to |$63.8$| months. The corresponding estimated restricted mean survival times of the strata are |$54.5$|, |$58.7$| and |$62.3$| months. To make inferences about the prediction of this selected final stratification scheme, we apply it to the Part II data. The corresponding Kaplan–Meier curves for three strata are given in Fig. 1. Based on the restricted area under the Kaplan–Meier curves derived from 1000 bootstrap samples, the point estimates and |$0.95$| confidence intervals for the stratum-specific restricted mean survival times are |$54.3$| (|$51.0$|, |$57.2$|), |$58.9$| (|$57.7$|, |$60.0$|) and |$62.0$| (|$61.2$|, |$62.8$|) months for the three strata, with |$n=245, 1350,$| and 2550 respectively.
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, gender, left ventricular ejection fraction, history of | 10 | |$16.92$| | 6 | 7 |
myocardial infarction, history of hypertension, | |||||
history of diabetes, estimated glomerular filtration rate, | |||||
ACE inhibitor treatment | |||||
2 | variables in Model 1 plus three treatment and | 13 | |$16.90$| | 6 | 9 |
estimated glomerular filtration rate interaction terms | |||||
3 | variables in Model 1 plus their first-order interaction terms | 55 | |$16.97$| | 6 | 9 |
4 | none | 0 | |$18.65$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, gender, left ventricular ejection fraction, history of | 10 | |$16.92$| | 6 | 7 |
myocardial infarction, history of hypertension, | |||||
history of diabetes, estimated glomerular filtration rate, | |||||
ACE inhibitor treatment | |||||
2 | variables in Model 1 plus three treatment and | 13 | |$16.90$| | 6 | 9 |
estimated glomerular filtration rate interaction terms | |||||
3 | variables in Model 1 plus their first-order interaction terms | 55 | |$16.97$| | 6 | 9 |
4 | none | 0 | |$18.65$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, gender, left ventricular ejection fraction, history of | 10 | |$16.92$| | 6 | 7 |
myocardial infarction, history of hypertension, | |||||
history of diabetes, estimated glomerular filtration rate, | |||||
ACE inhibitor treatment | |||||
2 | variables in Model 1 plus three treatment and | 13 | |$16.90$| | 6 | 9 |
estimated glomerular filtration rate interaction terms | |||||
3 | variables in Model 1 plus their first-order interaction terms | 55 | |$16.97$| | 6 | 9 |
4 | none | 0 | |$18.65$| | 0 | 0 |
Model . | Candidate independent variables . | dim|$(Z)$| . | |$\bar{\mathcal{L}}^*$| . | |$\mathcal{M}^*$| . | |
---|---|---|---|---|---|
. | . | . | . | # covariates . | |$\|\hat{\beta}\|_0$| . |
1 | age, gender, left ventricular ejection fraction, history of | 10 | |$16.92$| | 6 | 7 |
myocardial infarction, history of hypertension, | |||||
history of diabetes, estimated glomerular filtration rate, | |||||
ACE inhibitor treatment | |||||
2 | variables in Model 1 plus three treatment and | 13 | |$16.90$| | 6 | 9 |
estimated glomerular filtration rate interaction terms | |||||
3 | variables in Model 1 plus their first-order interaction terms | 55 | |$16.97$| | 6 | 9 |
4 | none | 0 | |$18.65$| | 0 | 0 |
Since this set of results depends on the choice of |$\tau$| used in the restricted mean survival time, we also conducted extensive sensitivity analyses with |$\tau$| from 60 to 80 months. The resulting stratification scheme is quite stable when |$\tau \le 77$| months. When |$\tau\ge 78$| months, the stratification rule starts to deviate substantially, perhaps due to the presence of extreme weights at low values of the censoring survival estimator |$\hat G(\cdot)$|.
7. Discussion
A common practice in predictive medicine is to create an ordered category system to classify future subjects according to their baseline information. A desirable quantitative stratification procedure would have both a small overall prediction error and a reasonable discriminatory capability across the strata. In this article, we provide a systematic approach to constructing such a stratification rule. The user can determine important components including the loss function, the minimum stratum size and the desired clinically meaningful difference between strata. Our method uses a heuristically interpretable metric, a loss function based on the |$L_1$|-norm, for quantifying the prediction error, but it could be replaced by, for example, mean squared error, integrated Brier score and other metrics for prediction performance (Graf et al., 1999; Choodari-Oskooei et al., 2012a,b). As a possible modification, one may also introduce a weight function |$w(\cdot)$| in the modified loss |$n^{-1}\sum_k \sum_{i \in S_k} |Y_i-\bar{Y}_k|w(V_i)$| to encourage finer or coarser stratification for high- or low-risk patients. The proposed stratification requires a minimum stratum size to avoid unstable small strata. The choice of this size depends on the amount of information in the training set Part Ia, which is usually quantified by the sample size or the observed event rate. In choosing its value, one also needs to consider the objective of the analysis. For example, if one wants to identify a small subgroup of high-risk individuals, the minimum stratum size needs to be smaller than the anticipated proportion of the high-risk stratum. To enhance the discriminatory ability of the scheme, we set a minimum incremental value between two consecutive stratum-specific predicted values. The choice of this value depends on clinical inputs. For example, for the cardiovascular study, the range of the restricted mean survival time scores is from |$51.0$| to |$63.8$| months based on the Part I training data, which is relatively narrow. A choice of an incremental value of three months for illustration in |$\S\,$|6 seems appropriate.
An obvious extension of the new proposal is to construct an optimal stratification procedure for treatment selections based on data either from randomized clinical trials or from observational studies. In such a case, one needs to predict the treatment effect measured by the difference of potential clinical outcomes of the patient under different treatments rather than the outcome itself. Unfortunately, the |$L_1$| loss function utilized in this article cannot be generalized to deal with this important problem. Further research on the choice of a clinically meaningful metric for quantifying the prediction error for treatment selections is warranted.
Acknowledgement
This research was supported by the U.S. National Institutes of Health and the National Cancer Institute. We thank the editor, associate editor and two referees for their valuable and constructive comments.
Supplementary material
Supplementary material available at Biometrika online includes a more detailed account of the dynamic programming algorithm and the asymptotic properties of the optimal stratification scheme.
References