Evaluating biomarkers for treatment selection from reproducibility studies

We consider evaluating new or more accurately measured predictive biomarkers for treatment selection based on a previous clinical trial involving standard biomarkers. Instead of rerunning the clinical trial with the new biomarkers, we propose a more efﬁcient approach which requires only either conducting a reproducibility study in which the new biomarkers and standard biomarkers are both measured on a set of patient samples, or adopting replicated measures of the error-contaminated standard biomarkers in the original study. This approach is easier to conduct and much less expensive than studies that require new samples from patients randomized to the intervention. In addition, it makes it possible to perform the estimation of the clinical performance quickly, since there will be no requirement to wait for events to occur as would be the case with prospective validation. The treatment selection is assessed via a working model, but the proposed estimator of the mean restricted lifetime is valid even if the working model is misspeciﬁed. The proposed approach is assessed through simulation studies and applied to a cancer study.


INTRODUCTION
A biomarker used to predict the response to a treatment is called a predictive biomarker. For example, patients with colon cancer can be treated by surgery alone or surgery plus chemotherapy. Surgery alone is less expensive and has fewer side effects than surgery plus chemotherapy, but it may be less effective as well, at least for some patients. For an individual patient, it is desirable to identify whether or not the patient will benefit more from the extra chemotherapy based on a biomarker or a set of biomarkers. A possible useful biomarker in this context is the c-myc gene, which is over-expressed in approximately 70 percent of human colonic tumors. Based on a study conducted by the Eastern Cooperative Oncology Group (ECOG), Augenlicht and others (1997) suggested that the c-myc gene may be of clinically prognostic importance in patients with colon cancer. Using a subset of the cases from this clinical trial, Li and Ryan (2006) found that there is an interaction between the c-myc gene expression levels and the two treatments for the response of disease progression-free survival; Song and Zhou (2011) investigated using the observed c-myc gene expression level for treatment selection. To evaluate the potential of biomarkers for treatment selection, approaches have been proposed that try to minimize the population event rate under (optimal) treatment selection criteria (Song and Pepe, 2004;Brinkley and others, 2010;Cai and others, 2011;Zhang and others, 2012;others, 2011, 2014) or maximize the population mean (restricted lifetime) (Song and Zhou, 2011).
However, biomarker measurements may contain measurement error, or the current biomarkers may not be very effective for treatment selection. In the ECOG study, the c-myc gene expression levels were measured with error (Li and Ryan, 2006). It is of interest to evaluate the amount of gain that could be achieved with respect to treatment selection if the biomarkers were accurately measured. In another aspect, a new technology may improve the measurement of biomarkers or a new biomarker may be identified with better predictability capacity. For example, with the development of polymerase chain reaction technique, the measurement of c-myc gene expression level may become more accurate. If we call the measurement of c-myc gene level in the original study the standard biomarker, and the improved measurement using advanced techniques the new biomarker, then we would like to assess the capacity of the new biomarker for making treatment selections. With a slight abuse of terminology, we refer both newly identified biomarkers and more accurately measured biomarkers as new biomarkers. Ideally, one would like to re-run the previous study, or perform a new study very similar to it, so that one could measure the new biomarkers using prospectively collected samples. But this is often not feasible. Such a procedure would entail large additional costs associated with obtaining samples of tumor tissues with measurement of disease progression-free survival through a multi-year randomized clinical trial. Moreover, it may not even be feasible to perform such a study. We propose a more efficient approach which requires only conducting a reproducibility study where the new biomarkers and the standard biomarkers will be measured on a set of patient samples or replicated measurements of the inaccurately measured standard biomarkers. Importantly, there is no need to re-run the clinical trial. This makes the study easier to conduct and much less expensive. In addition, it will make it possible to perform the estimation of the clinical performance of the new biomarkers quickly, since there will be no requirement to wait for events to occur. The idea of our approach is summarized in Figure 1. The outcome and the standard biomarkers are observed in the clinical trial, while the new biomarkers and the standard biomarkers are observed in the reproducibility study. We make inference on the new biomarkers versus the outcome to assess the capacity of the new biomarkers on treatment selection. There are related studies in the literature (Boostra andothers, 2013a, 2013b) aiming to predict outcome with new biomarkers observed on a subset in the original study, but their objectives are different from what is considered in this article.
In this article, we consider the setting in which the outcome is time to an event of interest (survival time), which may be subject to right censoring. To characterize the inter-relations between the biomarkers and the treatment arms, we adopt a working proportional hazards model with interactions between the treatment and the biomarkers. The relation between the standard markers and the new markers is modeled through a classical measurement error model. Various approaches have been proposed to estimate the regression coefficients in the presence of covariate measurement error under the proportional hazards model, however, most were derived under the assumption of linear covariate effects. These include regression calibration (Prentice, 1982;Dafni and Tsiatis, 1998;Wang and others, 2000), SIMEX (Greene and Cai, 2004), likelihood based approaches (Wulfsohn and Tsiatis, 1997;Faucett and Thomas, 1996;Henderson and others, 2000;Xu and Zeger, 2001;Song and others, 2002b), conditional score (Tsiatis and Davidian, 2001;Song and others, 2002a), and correction approaches (Huang and Wang, 2000), among others. Here, we extend the conditional score approach to the proportional hazards model with interactions. In addition, we propose to use the mean restricted lifetime to evaluate the performance of the predictive biomarkers and derive the optimal treatment selection strategy under the working model. To estimate the mean restricted lifetime, we propose a SIMEX estimator and establish the asymptotic properties using the empirical process and stochastic integral techniques.
The novelty of this article includes the following aspects. First, our idea of evaluating new biomarkers without re-running the clinical trial is novel, which could greatly reduce the study time and cost. Second, the adoption of the measurement error model and techniques under this circumstance is novel. Third, we propose well-justified resampling-based inference which extends the technique of Peng and Huang (2008). To the best of our knowledge, the overlay of the resampling-based inference with the already resampling-based SIMEX approach is new.
The article is organized as follows. In Section 1, we give the model definition. We derive an empirical estimator for the optimal treatment selection and propose an approach to evaluate and compare the new biomarkers and standard biomarkers on treatment selection in Section 2. We investigate the finite sample performance of the proposed approach in Section 3, and we apply the approach to the ECOG data in Section 4. Some discussions are given in Section 5. The regularity conditions and sketched proofs are given in the supplementary material available at Biostatistics online.

MODEL DEFINITION
Let T denote the survival time, and C denote the censoring time. The observed survival data are V = min(T , C) and = I (T ≤ C) , where I (·) is the indicator function. Let Z denote a vector of K continuous standard biomarkers, and X denote a vector of K continuous new biomarkers. Remember that, with a slight abuse of terminology, we refer both newly identified biomarkers and more accurately measured biomarkers as new biomarkers. Let A denote the treatment, where A = 0 denotes the control or standard treatment, and A = 1 denotes the new treatment. Suppose a randomized clinical trial has been conducted to evaluate the standard biomarkers for treatment selection with the observed data We are interested in evaluating the treatment selection capacity of the new biomarkers X .
For an individual with the new biomarkers X , intuitively, we may assign the subject to treatment A = 1 if This is an extension of the treatment rule for binary outcomes where the probability of success is compared for the two treatments (Janes and others, 2014). However, when censoring exists, the mean (unrestricted) survival time may not be estimable if the largest observed survival time is censored without some tail correction on the estimated survival function (Klein and Moeschberger, 2003). Alternatively, we consider the mean restricted survival time (lifetime) up to a given time L. The technique of restricting survival time has been used previously in estimating the mean lifetime and quality-adjusted lifetime (Zhao and Tsiatis, 1997;Chen and Tsitais, 2001) . Specifically, Let T * = min(T , L) be the restricted survival time, and then the optimal treatment A * opt (X ) = I (D * (X ) > 0); that is, if D * (X ) > 0, select A = 1; otherwise, select A = 0. The capacity of treatment selection based on A * opt (X ) can be evaluated by the population mean restricted lifetime under the optimal treatment selection, that is, (2.1) Without loss of generality, we assume an additive measurement error model where e ∼ N (0, ), and e is independent of (X , A, T , C). This is a natural model when the new biomarkers are obtained by improving the accuracy of measurement, but may represent a more general relationship between standard biomarkers and "true" new biomarkers. For example, if we have standard biomarkers Z * and new biomarkers X * and there exist functions g 1 and g 2 , which could be vector valued, such that it reduces to model (2.2) with Z = g 1 (Z * ) and X = g 2 (X * ). For simplicity of presentation, we assume that g 1 and g 2 are known. When g 1 and g 2 are unknown, the relationship between the standard biomarkers and the new biomarkers can be estimated as discussed in Section 6.
To ensure the identifiability of model (2.2), we need to have either validation data or replicated data on Z. We consider three cases. In case 1, a validation data set is available from an external reproducibility study. The observations in the reproducibility study are {(X # i , Z # i ) : i = 1, . . . , m}. In case 2, an internal validation data set of a size m is available in the original data set. Although we may directly evaluate the new marker using the validation set in this case, it would be more efficient to use the whole data set. In case 3, replicated error-contaminated observations are available on some subjects in the original study. Case 3 is only feasible when the new biomarkers are obtained by improving the accuracy of measurement while cases 1 and 2 also cover the situation when the new biomarkers are truly different variables. To unify the notations in the three cases, the observed data in the original study is denoted by where J i denotes the number of replicates for subject i, which always equals one in cases 1 and 2; R i = 1 for a subset of i ∈ ϒ ⊂ {1, . . . , n}. The set ϒ contains m elements for case 2, while it is empty for cases 1 and 3.

Estimation of the optimal treatment
To estimate X , we adopt a working model, which assumes the survival time depends on the new biomarkers X and the treatment A through a proportional hazards model where λ 0 (t) is an unspecified baseline hazard function, and β = (β T 1 , β 2 , β T 3 ) T are the regression parameters. Extension to more flexible survival models is discussed in Section 6. Under model (3.1), we have In fact, it can be easily seen that A opt (X ) = A * opt (X ) under model (3.1); that is, the optimal treatment based on the mean restricted survival time equals the optimal treatment based on the mean unrestricted survival time.
If X were observed, an ideal estimatorβ I of β could be obtained by the standard partial likelihood approach, and the ideal estimator of A opt (X ) isÂ I opt (X ) = A(β I , X ). Here and henceforth, we use the superscript "I " to denote the ideal approach. Since X is not observed in the original study or only observed in a subset ϒ, we may estimate β through measurement error approaches.
We adopt the conditional score approach (Song and others, 2002a) for cases 1 and 3 as it is simple to compute. The conditional score estimator was originally derived for the proportional hazards model without interactions. Here we extend it to model (3.1). Specifically, assume is known for now. Following similar arguments as those in Song and others (2002b), we may obtain the "complete sufficient statistic" The conditional score estimating equation can be written as with a ⊗r = 1, a, aa T for a vector a and r = 0, 1, 2, respectively; N i (t) = I (V i ≤ t, i = 1) is the counting process for the events, and Y i (t) = I (V i ≥ t) is the at-risk process. The error variance may be estimated by the method of moments estimatorˆ from the validation data or the replicated data (Song and others, 2002a). In case 2, X is observed in a subset ϒ. The partial likelihood estimatorβ ϒ of β may be obtained using observations in ϒ only. But this approach is not efficient as the information not in ϒ is not used. To improve the efficiency, following Wang and Song (2016), an improved estimator can be obtained. Specifically, it is the best linear combination ofβ ϒ andβ ϒ,N , which equalsβ ϒ −ˆ (β ϒ,N −β N ), wherê β ϒ,N andβ N are the naive estimates of β obtained by substituting W for X using the observations in ϒ and the whole data set, respectively, andˆ is given in Appendix A in the supplementary material available on Biostatistics online. For simplicity, both this estimator and the conditional score estimator are referred to as error-corrected estimators henceforth.
Denote the error-corrected estimators of β byβ C . The optimal treatment can be estimated byÂ C opt (X ) = A(β C , X ). Here and henceforth, we use the superscript "C" to denote the error-corrected approach. For now, assume model (3.1) holds.
PROPOSITION 1 Under the conditions C1-C4 given in Appendix B available on Biostatistics online, almost surely,β C exists and converges to β. In addition, n 1/2 (β C −β) converges to a mean zero normal distribution.
If model (3.1) is not the true model, it is used as a working model to obtain the treatment selection criterion. Fine (2002) showed that even if model (3.1) does not hold, the partial likelihood estimatorβ I still converges to some constant β * = (β * T 1 , β * 2 , β * T 3 ). It can be shown that the error-corrected estimators are consistent estimators of β * . Given the value of X , A(β I , X ) will converge to a valid treatment selection criterion A(β * , X ), which is equal to A opt (X ) when model (3.1) is correctly specified. When model (3.1) is misspecified, A(β * , X ) is the optimal treatment under (3.1), but may not equal to the optimal treatment A opt (X ). We consider evaluating A(β * , X ) and propose an empirical estimator of X (A(β * , X )).

Estimation of X
With some algebra, it can be shown that .
Then an estimator of X (A(β * , X )) could bê However, X is not observed in the original study or only observed in a subset ϒ. To deal with the measurement error, we may apply the SIMEX approach (Carroll and others, 2006). Assuming is known for now, for an increasing sequence of value of ζ starting from 0, for example, ζ = 0, 0.25, 0.5, . . . , 2, and Extrapolate (ζ ,ˆ ζ ) to ζ = −1 to get the SIMEX estimatorˆ C X . A regression model is usually adopted for the extrapolation, such as the quadratic and nonlinear (rational linear) extrapolation (Carroll and others, 2006). Specifically, suppose thatˆ ζ = h(θ, ζ ) + ζ , where ζ is an error term with mean 0. Letθ be the least square estimator of θ . Then the SIMEX estimator can be written aŝ When is not known, it can be replaced by the estimatorˆ obtained from the validation data or the replicated data. Specifically, PROPOSITION 2 Under the regularity conditions C1-C3 given in Appendix B available on Biostatistics online,ˆ I X is a consistent estimator of X (A(β * , X )) and n 1/2 (ˆ I X − X (A(β * , X ))) converges to a mean zero normal distribution. Further, with the additional conditions C4-C6,ˆ C X is a consistent estimator of X (A(β * , X )) and n 1/2 (ˆ C X − X (A(β * , X ))) converges to a mean zero normal distribution.
Proposition 2 indicates that even if the model (3.1) is misspecified, the empirical estimatorˆ C X is still a valid estimator of the mean restricted lifetime under the treatment selection criterion constructed based on the working model.
Model-based estimation. If the working model is the true model, a model-based estimator can be obtained. Noting that whereˆ I 0 (t) is the Breslow estimator of 0 (t). This cannot be applied directly since X is not observed in the clinical trial. By analogy to the empirical estimator, the SIMEX approach can be adopted to estimate X when X is not observed.
If model (3.1) is misspecified, it can be shown that the model-based estimatorˆ * m,X actually estimate , which is different from the optimal mean restricted lifetime X (A(β * , X )) under the working model. For example, under the scenario considered in our simulation studies in Section 4, if the true model is λ(t|X , A) = λ 0 (t) exp(−2.5X − 0.5X 2 + 0.2A + 5.0XA + X 2 A), the optimal mean restricted lifetime is 6.82, the optimal mean restricted lifetime is 6.79 under working model (3.1), while mis =6.60. Therefore the model-based estimator may not work well for estimation of the mean restricted lifetime under the working model in this case, while the empirical estimator is still a valid estimator.
We will focus on the empirical estimation for its robustness. This approach may be applied to the standard biomarkers as described in Section 3.4, which will facilitate the comparison of the standard biomarkers and the new biomarkers.

Marker-independent treatment selection
Marker-independent treatment selection would assign all subjects to one treatment A = a (a = 0 or 1) if it has been shown significantly better than the other treatment without taking into consideration of their marker values, for example, through the log-rank test. The corresponding mean restricted lifetime is E(T * |A = a), which can be estimated by is the Nelson-Aalen estimator of the cumulative hazard function (t|A = a) given A = a.

Compare the standard and new biomarkers
To compare the new biomarkers to the standard biomarkers, we need to estimate the capacity of the standard biomarkers Z on treatment selection. This can be evaluated by a working model that replaces X by Z in (3.1), that is, This approach is called the naive approach under the literature of measurement error models. Letβ N be the partial likelihood estimator based on the working model (3.7), where the superscript "N " is used to denote naive estimators. It can be shown thatβ N converges to some constant β # (Fine, 2002). Given Z, the optimal treatment selection criterion under the working model is Z = A(β # , Z) = I (β # 2 + β #T 3 Z < 0) and can be estimated by A(β N , Z). The mean restricted lifetime E(T * |A = A(β # , Z)) can be estimated empirically byˆ N Z =Ê e,Z (T * |A = A(β N , Z); Z), which is obtained by substituting Z for X andβ N forβ I in (3.4).
Another way to evaluate a treatment selection rule A is to evaluate the probabilities of subjects misassigned to the non-optimal treatment. If the optimal treatment is 1, the misassignment probability is P 0 (A) = P(A = 0 and A opt = 1), (3.8) and if the optimal treatment is 0, the misassignment probability is P 1 (A) = P(A = 1 and A opt = 0). (3.9) The overall misclassification probability P(A) = P 0 (A) + P 1 (A) = P(A = A opt ). Thus, we may compare the treatment selection rules A(β * , X ) and A(β # , Z) based on (3.8) and (3.9).
In practice, X is not observed in the original study or only observed in a subset ϒ. If (3.1) is correctly specified, the overall misassignment probability of using Z equals if X were observed. Since X is not observed, we may apply the SIMEX approach to obtain the estimate of P(A(β, X ) = 1).
When the original data contain replicated observations, Z may be replaced by the meanZ of the replicated observations, which has better performance than Z with reduced measurement error, and we may compare the treatment selection usingZ vs. X .

Resampling-based inference
Since the asymptotic variance for the empirical estimator depends on the unknown density and hazard functions, the estimation requires smoothing and may not work well when n is not large. We develop a resampling-based approach by analogy to that used in Peng and Huang (2008). We describe how to derive the variance estimator forˆ C X in case 3; the process is similar in the other two cases. Specifically, we generate {η i : i = 1, . . . , n} from a known nonnegative distribution with mean 1. Using η i as weights in the method of moment estimating equation, we first obtain the perturbed estimatorˆ P of , where the superscript "P" stands for the perturbed estimator. Then using η i as weights and replacing byˆ P in the conditional score estimating equation (3.2), we obtain the perturbed estimatorβ C,P of β. Next, we obtain the perturbed estimatorˆ C,P e,X through the perturbed SIMEX process where for each ζ and b, replace By repeatedly generating {η i : i = 1, . . . , n}, we obtain a large number of realization ofˆ C,P X , denoted by {ˆ C,P X ,r } R r=1 . It can be shown that conditional on the observed data, n 1/2 {ˆ C,P X ,r −ˆ C X } has asymptotically the same distribution as n 1/2 {ˆ C X − X }. Thus the variance ofˆ C X can be estimated by the sample variance of {ˆ C,P X ,r } R r=1 , and the confidence interval of X can be constructed through Wald method or by the percentiles ofˆ C,P X ,r . When the error variance is estimated from the validation data in case 1, the perturbed estimator ofˆ P of is obtained from the validation data with a separately generated set of perturbation variables {η R j : j = 1, . . . , m}.

SIMULATION STUDIES
We conducted simulation studies to evaluate the performance of the proposed approaches. Mimicking the case of c-myc gene in the ECOG study, we consider treatment selection using one biomarker. The new biomarker X was generated from a standard normal distribution, and the measurement error was generated from a normal distribution with mean 0 and variance = 0.1, 0.3, 0.5, or 0.7. The survival data contains observations of (V , , A, Z) on n = 500 or 2000 subjects, where A was generated from a Bernoulli distribution with probability 0.5, mimicking treatment assignment in a randomized clinical trial. The survival time was generated based on model (3.1) with β 1 = 2.5, β 2 = −0.2, β 3 = −5.0, and λ 0 (t) = 0.4. The optimal mean restricted lifetime under A opt equals 6.93. The censoring time was generated from an exponential distribution with mean 20 and truncated at 12. The censoring rate was 30%. We considered three cases: (i) the error variance is estimated from an external validation data set that contains observations of (X , Z) on 500 subjects; (ii) the error variance is estimated from replicated observations of Z in the original data with J i = 2 for i = 1, . . . , n; (iii) an internal validation data set is available with m = 0.2n. For each setting, 500 simulated data sets were generated.
We obtained the error-corrected estimators and naive estimators of β, A opt , and E(T * |A = A opt ) as described in Section 3. As a benchmark, we also obtained the ideal estimators of β, A opt , and E(T * |A = A opt ), assuming X is observed in the survival data. When the error is large, the root to the conditional score estimating equation U n (β) = 0 may not exist, and there may exist some "outliers." We calculate the bias based on the median of the estimates, and the standard error by the normalized median absolute deviation (MAD) via the resampling method. The 95% Wald confidence intervals were calculated correspondingly. For the SIMEX approaches, we adopted the rational linear extrapolation, and used the quadratic extrapolation as a backup if the rational extrapolation failed (Carroll and others, 2006). The perturbation variable η was generated from a location-scale transformation of beta( √ 2 − 1, 1) and we set R = 500 in the resampling process and B = 500 in the SIMEX process.
The results of estimating β for n = 500 and 2000 are shown in Tables S1 and S2 in the supplementary material available at Biostatistics online, respectively. Compared to the ideal estimates, the naive estimates show clear bias with the coverage probabilities well below the nominal level; the coverage probabilities worsen when the sample size increases and error variance gets larger. The error-corrected estimates perform reasonably well for n = 500, and the performance improves when the sample size increases to n = 2000.
We also estimated the misassignment probabilities using the standard biomarker and the new biomarker for A opt = 0 and 1 separately. The misassignment probabilities would be zero if the treatment assignment was based on the true model with the true regression coefficients. The estimated misassignment probabilities are shown in Table 1. The estimated misassignment probabilities of the error-corrected estimates based on the new marker are close to the ideal estimates, and they are much lower than those from the naive approach based on the standard biomarker. When n = 2000 and error standard variance increases from 0.1 to 0.7, the estimated total misassignment rate increases from 0.49% to 2.03% in case 1, 0.64% to 0.84% in case 2, and 0.44% to 0.93% in case 2 from the error-corrected approaches based on the new biomarker, while from 9.76% to 22.21% in case 1, 9.76% to 22.14% in case 2, and 6.96% to 17% in case 3 from the naive approach based on the standard marker. This shows the advantage of adopting the new biomarker, especially when the measurement error is large.
We further estimated the mean restricted lifetime within 10 years using the new biomarker and the standard biomarker. The results are shown in Table 2. The optimal mean restricted lifetime is 6.93 using the new marker, whereas when 1/2 increases from 0.1 to 0.7, the mean restricted lifetime based on the standard marker decreases from 6.72 to 6.02 in cases 1 and 2 and 6.82 to 6.35 in case 3. The estimates perform reasonably well for n = 500 and improves when n = 2000.
Under this setting, the average survival time for treatment A = 1 is significantly longer than A = 0 without adjusted for the biomarker value. Therefore, marker-independent treatment selection will assign all subjects to treatment 1. The corresponding mean restricted lifetime within 10 years is 3.984, which is much shorter than marker based estimates. The corresponding misassignment rate is 48.4%, which is much higher than marker based estimates.
We also consider a simulation setting when the true model is misspecified as (3.1). The simulation setting is the same as case 1 above except that the true model is λ(t|X , A) = λ 0 (t) exp(−2.5X − 0.5X 2 + 0.2A + 5.0XA+X 2 A) with λ 0 (t) = 0.4. In this case, the treatment selection based on the working model (3.1) is not optimal overall. Based on the working models (3.1) and (3.7), we estimated the misassignment probabilities and the mean restricted lifetime using the new biomarker and the standard biomarker, respectively. The results for n = 2000 are shown in Tables S3 and S4 in the supplementary material available at Biostatistics online. For estimation of the overall misassignment rate, the estimate is 3.53% from the ideal approach. When the error variance increases from 0.1 to 0.7, the estimates from the error-corrected approach based on the new biomarker are close to that from the ideal approach, while the estimates based on the standard biomarker increase considerably from 10.21% to 22.53% for the naive approach with the error variance increases from 0.1 to 0.7. The optimal mean restricted lifetime based on the new marker is 6.79 under model (3.1), while the mean restricted lifetime based on the standard marker decreases from 6.59 to 5.92 when increases from 0.1 to 0.7. The estimates of the mean restricted lifetime perform well.

APPLICATION
We applied the proposed approach to a subset of the ECOG clinical trial, which was analyzed in Li and Ryan (2006), where c-myc expression level was measured via dot plots on 92 patients randomized to receive surgery alone or surgery plus chemotherapy, both progression-free survival and overall survival were recorded. The results for overall survival are presented as follows while the results for progression-free survival are presented in the supplementary material available at Biostatistics online. Based on the log-rank test, surgery plus chemotherapy is better than surgery alone for overall survival (p-value = 0.0186). Thus marker-independent treatment selection would assign all patients to surgery plus chemotherapy. Using this subset of data, Li and Ryan (2006) found marginally significant interaction between treatment A = I (surgery plus chemotherapy) and log c-myc gene expression level X for progression-free survival (p-value = 0.073) and overall survival (p-value = 0.112) under model (3.1), and Song and Zhou (2011) evaluated bisecting the observed c-myc gene expression level Z for treatment selection. It is conjectured that a more accurate measurement of c-myc gene expression level might improve its capacity for treatment selection. In this subset, 26 subjects have replicated c-myc gene expression measurements with the estimated measurement error standard deviation equal to 0.20. The plot of the residuals Z ij −Z i and the corresponding Q-Q plot indicate that it is reasonable to assume the error is normal with constant variance (Figure 2). We checked the proportional hazards assumption in model (3.7) using the method in Therneau and Grambsch (2006, Chapter 6.2). To check the proportional hazards assumption in model (3.1), as X is not  available, we adopted SIMEX approach to obtain the error-corrected p-values. There is no evidence of violation of the proportional hazards assumptions in both models. We would like to assess the amount of gain of improving c-myc gene expression level measurement on treatment selection and to compare treatment selection based on the true value X and the errorcontaminated observation. To reduce variation from SIMEX and bootstrap, we took larger values of R = 5000 in the resampling process and B = 5000 in the SIMEX process. Under the working proportional hazards model (3.1) and (3.7), the estimated coefficients and standard errors are shown in Table 3. The error-corrected estimates of log c-myc level and the interaction are larger in magnitude. Among the 92 subjects, 48 (47.8%) patients were assigned to surgery alone and 44 to surgery plus chemotherapy. An estimate of 87% patients would be assigned to surgery plus chemotherapy if the treatment is selected using Z, and 82% if the treatment is selected using X . This suggested that 10% patients might be assigned to the wrong treatment if model (3.1) is true and the treatment is selected using Z. The estimated mean restricted lifetime within 5 years is 3.85 if all patients are assigned to surgery alone, and 4.10 if all are assigned to surgery plus chemotherapy. The estimated mean restricted lifetime within 5 years is 4.40 when the treatment is selected using X while 4.13 when the treatment is selected using Z. There is no significant difference between the mean restricted life times (95% confidence interval −0.07, 0.61). With the consideration of the possible side effects of chemotherapy, the treatment selection based on the new marker seems to be better as less patients are assigned to surgery plus chemotherapy.

DISCUSSION
We have proposed a novel method to evaluate a new biomarker based on data from a reproducibility study. Our approach assumes that the reproducibility study samples and the clinical trial samples are randomly drawn from a common target population. It is beyond the scope of this article to address the impact of violations of this assumption.
We have aimed on maximizing the population mean-restricted lifetime. This method can be extended to other statistical measures, for example, L-year survival rate, or a more flexible utility function that incorporates notions of cost and quality of life. Although we have focused on survival time as outcomes, the approach can be adapted to discrete and continuous outcomes with minor modifications.
For simplicity, we have adopted proportional hazards models as working models. The estimation of mean restricted lifetime under the treatment selection criterion based on the working model is still valid even if the model is misspecified. Our approaches may also be extended to more flexible models, for example, by allowing time-varying treatment or covariate effects, or including nonparametric functions of covariates in the survival model. Other types of survival models, such as the accelerated failure time model or the additive hazards model, may also be used. Such extensions may warrant further investigation.
In addition, we have assumed a classical measurement error model between the new marker and the standard marker. An more flexible model such as (2.3) with g 1 and g 2 unknown may be adopted and could be estimated using data from the reproducibility study with validation data based on parametric models or spline approximation. In some situations, both markers might involve measurement error while the measurement error for the new marker might be smaller. It would be of interest to extend our approaches to accommodate such complexity.
The model of equation (2.2) assumes that the new assay reduces error compared to the existing assay. This is motivated by the fact that often newer technologies provide better measurements than existing technologies. However, the new technologies are often more expensive as well. There is a cost-benefit tradeoff for many new assays. Our modeling approach can help in the quantification of how large the potential benefit might be from a new assay that is under consideration. This can be used to make the "go" versus "no go" decision on whether or not to switch to the new marker in the future. In other words, by quantifying the benefit, our method gives the information needed to make this decision.Another application of our approach is to the setting where a gold standard treatment-selection biomarker exists along with an approximation to the biomarker. The approximation may be based for example on pathological evaluations, like the Magee score as an approximation to the OncotypeDX score (Farrugia and others, 2017). Our approach provides a framework in which to understand the relationship between two such scores as predictors of a clinically important survival outcome.

SOFTWARE
The R code and a sample data set are available on GitHub (https://github.com/xsong88/Evaluate-Biomarkers).