SUMMARY

An intermediate response measure that accurately predicts efficacy in a new setting at the individual level could be used both for prediction and personalized medical decisions. In this article, we define a predictive individual-level general surrogate (PIGS), which is an individual-level intermediate response that can be used to accurately predict individual efficacy in a new setting. While methods for evaluating trial-level general surrogates, which are predictors of trial-level efficacy, have been developed previously, few, if any, methods have been developed to evaluate individual-level general surrogates, and no methods have formalized the use of cross-validation to quantify the expected prediction error. Our proposed method uses existing methods of individual-level surrogate evaluation within a given clinical trial setting in combination with cross-validation over a set of clinical trials to evaluate surrogate quality and to estimate the absolute prediction error that is expected in a new trial setting when using a PIGS. Simulations show that our method performs well across a variety of scenarios. We use our method to evaluate and to compare candidate individual-level general surrogates over a set of multi-national trials of a pentavalent rotavirus vaccine.

1. Introduction

A surrogate outcome that could be used to accurately predict the treatment effect for a given individual in a new setting would be advantageous. Given the many definitions of surrogacy in the literature, the task of simply and precisely defining such a surrogate is difficult. As outlined by Joffe and Greene (2009), two main surrogate evaluation paradigms are the causal association paradigm and causal effects paradigm. Under the causal association paradigm, causal treatment effects on the surrogate are used to predict causal treatment effects on the clinical outcome. Under the causal effects paradigm, both the treatment effect on the surrogate and the association of the surrogate with the clinical outcome are used to predict the clinical outcome. The two frameworks outlined by Joffe and Greene (2009) under the causal association paradigm are the principal stratification and the meta-analysis frameworks. The two frameworks outlined in Joffe and Greene (2009) under the causal effects paradigm are the Prentice criteria (Prentice, 1989) and the direct and indirect effects frameworks (Robins and Greenland, 1992). Given the predictive nature of our desired surrogate, we will focus on the causal association paradigm for our proposed evaluation method.

The meta-analysis and principal stratification frameworks differ primarily in two ways. The first is in their level of prediction, trial, or individual. Trial-level surrogates are predictors of trial-level treatment effects on the clinical outcome while individual-level surrogates are predictors of clinical outcomes and treatment effects for individuals. The principal stratification framework was introduced by Frangakis and Rubin (2002) as a way to evaluate individual treatment effects within sub-groups stratified by post-randomization variables without introducing confounding, making surrogates evaluated with principal stratification methods individual-level surrogates. The vast majority of meta-analysis methods have been developed to evaluate trial-level surrogates under the causal association paradigm; to our knowledge all existing methods of individual-level evaluation via meta-analysis are evaluated under the causal effects paradigm (Burzykowski and others, 2001; Buyse and others, 2000). Given our aim of obtaining individual-level predictions, we propose to use the principal stratification framework and evaluate the individual-level treatment effects on the candidate surrogate.

The second difference between the existing meta-analysis and principal stratification methods is the settings in which an evaluated surrogate is considered to be accurate or useful as a predictor. If predictions based on a given surrogate are believed to be accurate only within the specific setting of evaluation, it is referred to as a specific surrogate. Surrogates evaluated under the principal stratification framework are almost always specific, as usually only one clinical trial setting is used for evaluation. If a surrogate is believed to be an accurate predictor in a new setting, beyond the evaluation setting, it is often referred to as a general surrogate (Qin and others, 2007; Gilbert and others, 2008). Surrogate evaluation under meta-analysis methods, in most cases, is intended for general surrogates, although the generalizability of the resulting surrogate is limited by the diversity of the settings used for evaluation. Based on these three sets of classifications, paradigm, level, and generalizability, we propose to call a surrogate evaluated under the casual association paradigm and the principal stratification framework that can accurately predict individual-level efficacy in a new setting a predictive individual-level general surrogate (PIGS); a precise definition is given in Section 3.

We propose a method that uses individual-level data across multiple trials to evaluate an individual-level surrogate under the causal association paradigm and the principal stratification framework. Two articles proposed principal stratification methods for use over a set of trials (Li and others, 2011; Jiang and others, 2016). Both articles suggested the use of Bayesian hierarchical models to investigate binary surrogates for a binary clinical outcome and both articles apply their methods to colon cancer data sets. However, neither Li and others (2011) nor Jiang and others (2016) extended the concept of individual-level surrogacy for prediction in a new setting, which is a central focus this work. We propose to use existing methods of specific surrogate evaluation to assess the relationship between an arbitrary clinical outcome (i.e. continuous, count, time-to-event) and a continuous candidate surrogate, by estimating a contrast curve as a function of the candidate surrogate.

This is known as the causal predictiveness curve approach to surrogate evaluation and was introduced in Gilbert and Hudgens (2008) for specific surrogate evaluation under the principal stratification framework. Our suggested method relies on augmented trial designs to obtain information about the missing counterfactual surrogate values and estimated maximum likelihood (EML) estimation in settings where missing counterfactuals persist. Importantly, our proposed method offers an advancement over all existing methods, to the knowledge of the authors, by quantifying the prediction error that one can expect when using a PIGS for individual-level prediction in a new setting. Gilbert and Hudgens (2008) and Gilbert and others (2011) suggested that something like our proposed method could be used to evaluate surrogates, as well Frangakis and Rubin (2002) gave statistical generalizability as a proposed property of a principal surrogate, but to the knowledge of the authors, a similar method that allows for the investigation of out-of-sample prediction error for an individual-level surrogate has not been formalized in the literature.

In this article, after outlining the necessary notation and desired risk estimands in Section 2, we introduce the definition, usefulness and criteria for a PIGS in Section 3. In Section 4, we outline our suggested set of assumptions and the trial design of our motivating example. In Section 5, we outline EML estimation for the risk estimands, the testable hypotheses of the proposed PIGS criteria, and the procedure for evaluating the criteria. In Sections 6 and 7 we apply our proposed method to simulated examples and our motivating application.

We use our method to evaluate and compare PIGS for the pentavalent rotavirus vaccine RotaTeqTM (RV5) (Merck & Co., Inc., Kenilworth, NJ USA). RV5 has been shown to be efficacious against rotavirus gastroenteritis (RVGE) in many settings, but in developing countries where the disease burden is highest, the efficacy is lower than in developed nations. Although some evidence of trial-level surrogacy has been found in previous work (Gabriel and others, 2016), a clear measure of post-randomization immune response at the individual level that can be used to predict individual efficacy has not been identified. We show how our approach can be used to evaluate a candidate PIGS in the setting of the RV5 trials.

2. Notation and risk estimands

Following the literature on evaluating specific principal surrogates, we propose to consider the counterfactual potential outcomes of the post-randomization candidate surrogate and the clinical outcome. Suppose there are |$J$| trials indexed by |$j=\{1 \dots J\}$|⁠, each containing |$n_j$| subjects. Let the subjects within a trial be indexed by |$i$|⁠. Let |$Z_{ij}=\{0,1\}$| be the treatment indicator for subject |$i$| in trial |$j$|⁠.

Let |$\textbf{X}_{ij}$| be the observed vector of all baseline covariates measured for subject |$i$| in trial |$j$|⁠, such that |$\textbf{X}_{ij}=\{\textbf{B}_{ij}, \textbf{q}_j\}$| where |$\textbf{B}_{ij}$| are all individual-level baseline variables and |$\textbf{q}_j$| are all trial-level baseline variables, such as the region in which a trial was run. Let |$\textbf{A}_{ij}$| be the vector of all post-randomization measurements for individual |$i$| in trial |$j$| measured before or at |$\tau$| time post-randomization. For post-randomization measures, including the time on trial |$T_{ij}$| and clinical outcome |$Y_{ij}$|⁠, there are potential values under each level of Z, such that each subjects’ potential and observed measures are given by

Although this vector of measures is assumed independently and identically distributed (iid) within a given trial |$j$|⁠, clinical and surrogate outcome values may tend to be more similar within a given trial than between them. Although we denote all interventions by |$Z$|⁠, the interventions need not be the same over the trials. For example, one may wish to evaluate malaria parasite density immediately following treatment as a surrogate for malaria cure over several different types of malaria treatment which have different mechanisms of parasite killing. Provided the same candidate surrogates are measured, at the same relative time, for the same clinical outcome, the treatments need only be towards the same end for our proposed methods to be useful for surrogate evaluation.

Let |$S_{ijk}(z)=f_k(\textbf{A}_{ij}(z))$| be some function of the potential values |$\textbf{A}_{ij}(z)$|⁠, where |$k$| denotes the |$k$|th candidate PIGS of which there are |$K$|⁠. We will suppress dependence on |$k$| when we are only discussing a single candidate. The function |$f_k$| defining the candidate PIGS can be as simple as a single peak drug level measurement in the blood or contain all the post-randomization variables observed in all the trials. When determining which candidate PIGS to consider it is important to keep in mind that all future settings attempting to use this PIGS will need to measure all of the inputs, and that all the observed trials must have the same |$\textbf{A}_{ij}$| available. Realizations of the observed conditional variable |$S_{jk}|Z=z$| for trial |$j$| will be denoted as |$s_{zjk}$|⁠, which is always a vector indexed by |$i$|⁠, the elements of which are denoted as |$s_{zijk}$|⁠.

Using this notation, the joint conditional risk estimands can be defined for a single trial by:
(2.1)
and the marginal conditional risk estimands similarly by:
(2.2)

We also define the risk estimands that only condition on baseline characteristics and not |$S$|⁠: |$risk_{z,j}(\textbf{b})$|⁠. We will focus on the count endpoint hereafter for clarity, however all methods can be easily adapted to the time-to-event setting, the continuous outcome setting, or the binary outcome setting. Methods for estimating either of the conditional risk estimands for arbitrary types of clinical endpoints have previously been developed and, as we will discuss in detail in the later sections, all other methods herein can be applied directly once either the marginal or joint conditional risk estimates have been obtained (Qin and others, 2008; Gilbert and Hudgens, 2008; Wolfson and Gilbert, 2010; Huang and Gilbert, 2011; Huang and others, 2013; Gabriel and others, 2015; Gabriel and Gilbert, 2014).

We are interested in contrasts of the conditional risk estimands between treatment and control. Gilbert and Hudgens (2008) defined a contrast function, |$h(x,y)$|⁠, such that |$h(x ,y)=0$| if and only if |$x =y $| to be a causal effect predictiveness surface (CEP); here |$x$| and |$y$| are replaced with either of the sets of conditional risk estimands. Contrasts of this nature are of interest in surrogate evaluation for the same reason sub-group analysis is of interest for evaluating a baseline biomarker as predictive of a treatment effect. As we are not interested in the association between the candidate surrogate and the clinical outcome, but rather in the causal effect on the candidate surrogate and the causal effect on the clinical outcome, we must consider the contrast of the two randomized groups conditional on the treatment effect on the candidate surrogate. The hope is that there is a large treatment effect on the clinical outcome when there is a large treatment effect on the candidate surrogate, and there is no treatment effect on the clinical outcome when there is no treatment effect on the candidate surrogate. For example, if a subject would have a large difference between the candidate surrogate under placebo, |$S(0)$|⁠, and under treatment, |$S(1)$|⁠, it is hoped that this implies a large treatment effect on the clinical outcome as well.

The CEP lets us evaluate evidence of such an association, by allowing the clinical treatment effect to vary in the candidate surrogate treatment effect. The CEP is a curve or simply a set of comparisons for a continuous and categorical candidate surrogate, respectively, when contrasting the marginal risk estimands. We will focus on the marginal risk estimands for simplicity, but any of the results can be extended to the joint risk estimands. The joint and marginal risk estimands are equivalent when |$S_{ij}(0)=C$| where |$C$| is the same constant for all subjects, as in this case changes in |$S_i(1) - S_i(0)$| and |$S_i(1)$| would be equivalent. The candidate surrogate measurement under placebo |$S_{ij}(0)$| is often constant for all |$i$| in placebo controlled vaccine trials in antigen naive subjects, when antigen recognition in some form is used as the surrogate. For example, in HIV vaccine trials where HIV antigen recognition is considered as a candidate surrogate for the clinical effect on HIV infection, subjects receiving placebo who are HIV negative will have no measurable response.

In recurrent event and treatment settings, it is unlikely that |$S_{ij}(0) = C $| for all subjects |$i$|⁠. However, there are some functional forms of the surrogate, such as the difference in the post-randomization variables and that same variable’s baseline value, which may induce |$S_{ij}(0) =C $| for all |$i$| and |$j$|⁠. This concept is discussed in detail in Gabriel and Gilbert (2014). By using the change from baseline in a biomarker as the candidate surrogate, instead of the post-randomization measurement alone, one can often induce |$S_{ij}(0) =C$| for all |$i$| and |$j$|⁠, making the joint and marginal risk estimands equivalent. We use this procedure in our motivating example, as RVGE of any severity is a recurrent event.

3. PIGS definition, usefulness, and criteria

Predictive individual-level general surrogate (PIGS)—If the individual-level treatment effect on a biomarker or a function of biomarker(s) accurately predicts the clinical treatment effect on a clinical outcome at the individual level and this predictive association is generalizable to a new setting, then that biomarker (or function of biomarker(s)) is a PIGS for that treatment effect in that new setting.

The usefulness of a PIGS is two-fold. First, like a study-specific surrogate, a PIGS can be used for intervention improvement by giving a target, as well as a meaningful endpoint, in early phase trials. However, the assumption that the surrogate can be used in the new setting of interest can be considered more plausible, if not testable, for a PIGS. Second, unlike the specific surrogate, a high quality PIGS can be used in post-licensure settings to help inform individual medical decisions. Some examples of post-licensure surrogates are using hepatitis B antigen responses to determine if prophylactic treatment is needed to prevent acute hepatitis B infection after an accidental needle stick, and using tumor response to determine if an additional therapy should be added to the regimen of a breast cancer patient. Although both of these surrogates are already being used, they have not been evaluated for general use at the individual level, and therefore their use may be misleading. With these goals in mind, we specify the criteria and evaluation procedure for PIGS.

We propose that the existing criteria for a study-specific principal surrogate are also useful for evaluating candidate PIGS because a high quality PIGS should also be a high quality specific surrogate within each evaluation trial setting. Some recent work in principal surrogate evaluation has suggested a single criterion for a biomarker to have any value as a principal surrogate based on the CEP: that it varies greatly in the candidate surrogate. The idea is that provided the clinical treatment effect varies as a function of the treatment effect on the surrogate, a treatment can target that level of the candidate surrogate, and hopefully improve the treatment effect. For example, if we find that the CEP conditional on a high level of immune response suggests a high level of vaccine efficacy, a new formulation of the vaccine can increase the dose of antigen or the number of doses in an attempt to induce higher immune responses and hopefully improve efficacy for more subjects.

This criterion has been referred to as wide effect modification. We define wide effect modification based on the marginal CEP as there exists some |$s^{*}_1 \neq s_1$| such that |$|h(risk_0(s^{*}_{1}), risk_1(s^{*}_{1})) - h(risk_0(s_{1}), risk_1(s_{1}))| >\delta$|⁠, where the larger |$\delta \geq 0$| the better the candidate surrogate; this simply means the CEP is not constant in |$s_1$|⁠.

Frangakis and Rubin (2002) gave a criterion for specific surrogacy, which given in terms of the marginal risk estimands is |$risk_1(s_1)=risk_0(s_1)$| if |$s_1=0$|⁠. This has been called average causal necessity. Gilbert and Hudgens (2008) suggested an additional criteria based on the CEP, calling it average causal sufficiency. Average causal sufficiency is the existence of some level |$C$| such that |$h(risk_0(s_{1}), risk_1(s_{1})) > 0$| for all |$s_{1}>C$|⁠. When |$C = 0$|⁠, this has been referred to as one-sided strong average causal sufficiency. An ideal surrogate would fulfill both of these criteria, but neither is needed for a surrogate to have at least some value as a predictor of efficacy in a particular trial setting.

The above criteria, if they held in all trials, would give us evidence that a candidate surrogate is useful as a predictor within each of the trial settings. However, in addition to the study-specific criteria, there should be consideration of predictive accuracy outside the trials used to fit the risk model for the evaluation of a PIGS. We outline in detail our suggested leave-one-trial-out quantification of predictive accuracy in a new setting in Section 5. Therefore, we propose two criteria for a biomarker to have any value as a PIGS, with a third criterion for an ideal PIGS, which not only has at least some predictive value in a new setting, but also is the ideal surrogate within each of the evaluation trial settings. These criteria are as follows,

PIGS criteria:

  • 1. Wide effect modification in each trial or non-constant CEP in |$s_1$|⁠.

  • 2. Significant predictive value outside of the evaluation set, what we will call out-of-sample predictive value. The risk model conditioning on S gives significantly better prediction of the clinical treatment effect at the individual level than does the model without S, for a setting not used to fit either risk model.

  • 3. Ideal criterion: one-sided strong average causal sufficiency and average causal necessity within each trial, |$h({\rm{risk}}_{0j}(s_{1}), {\rm{risk}}_{1j}(s_{1})) > 0$| for all |$s_{1}>0$| and |${\rm{risk}}_{1j}(s_1)={\rm{risk}}_{0j}(s_1)$| if |$s_1=0$|⁠.

Criterion 1 shows that the candidate surrogate has at least some value as a specific surrogate within each trial setting. Criterion 2 ensures that the within-trial predictive value of the surrogate |$S$| has at least some value in a new setting outside those used to fit the current risk model. Criteria 1 and 2 are the essential characteristics for a PIGS. However, these criteria do not ensure an ideal surrogate. When both parts of criterion 3 hold in all settings, this implies that any effect on the surrogate is the same as an effect on the clinical outcome, although they may differ in magnitude, and that no effect on the surrogate is the same as no effect on the clinical outcome. This is clearly a very useful characteristic, but is not needed for a surrogate to be of use. For example, if, over all the trials, zero effect on the candidate surrogate is associated with a moderate effect on the clinical outcome that increases with increased effect on the surrogate, and this association is predictive in all settings, this is a useful predictive surrogate. This surrogate does not fulfill either part of criterion 3, but can still be used to help improve treatments in new settings and predict individual efficacy for the current treatment. Clearly when all criteria hold, the candidate surrogate is a highly useful PIGS.

The evaluation of wide effect modification has been addressed in numerous papers and has varied from formal hypothesis testing of the parameters of the parametric model assumed for the risk estimands to visual assessment of the CEP curve. When parametric models are used, tests for wide effect modification can be based on the coefficients. Under non-parametric estimation, direct comparison of the CEP under different levels of |$\{S(1), S(0)\}$| can be undertaken. The evaluation of one-sided strong average causal sufficiency is less straightforward; if average causal necessity holds for a parametric model that is linear in |$s_1$|⁠, one-sided strong average causal sufficiency holds if the coefficients of |$s_1$| imply that efficacy is increasing in |$s_1$|⁠. Although there is no way to directly test for average causal necessity, an estimate of zero clinical efficacy when there is zero treatment effect on the candidate surrogate with a tight confidence interval is a strong indication that average causal necessity holds. Testing will vary by the assumed risk model and the CEP used, as methods have been developed in previous work for the evaluation of all of these criteria for a number of clinical outcomes and CEP types, we will not go into further detail here. Please see Follmann (2006), Gilbert and Hudgens (2008), Qin and others (2008), Wolfson and Gilbert (2010), Huang and Gilbert (2011), Huang and others (2013), Gabriel and Gilbert (2014) and Gabriel and others (2015) for details regarding specific clinical endpoints and CEP forms.

3.1. Evaluating out-of-sample predictive accuracy

For ease of notation, let the marginal CEP within trial |$j$|⁠, |$h_j({\rm{risk}}_{0j}(s_{1j}), {\rm{risk}}_{1j}(s_{1j}))$|⁠, be denoted as |$h_j(s_{1j})$| and the value of the CEP for subject |$i$| in trial |$j$| as |$h_{ij}(s_{1ij})$|⁠. Similarly, define the CEP for trial |$j$| based on information from all |$J$| trials as |$h_{j,J}(s_{1j})$| and the CEP value for subject |$i$| in trial |$j$| from it as |$h_{ij,J}(s_{1ij})$|⁠. We can use this same model to make out-of-sample predictions, that is, to predict in a different trial. Leaving out the |$j$|th trial but using all other trial information, we denote this CEP as |$\widehat{h}_{j,J(-j)}(s_{1j})$| and the CEP values from this for subject |$i$| in trial |$j$| as |$\widehat{h}_{ij,J(-j)}(s_{1ij})$|⁠. To evaluate out-of-sample predictive accuracy we want to compare these values to the CEP values based on the |$j$|th trial alone |$h_{ij}(s_{1ij})$|⁠.

We propose a leave-the-|$j$|th-trial-out cross-validation method for evaluating predictive accuracy of the candidate PIGS in a new trial setting. Regardless of the model, we propose the absolute prediction error statistic as the summary for PIGS evaluation and comparison. The absolute prediction error, denoted by |$D$|⁠, is the absolute difference between our estimate for the next subject in an unknown setting and their true value. When the setting from which a new subject is drawn is known, a similar statistic conditioning on that setting, |$D_{j}$|⁠, is our desired statistic. Formally, |$D_{j}$| and |$D$| are given by:
where |$*$| is an unobserved subject in trial |$j$|⁠, or in a new setting |$J+1$|⁠, whose realized value of the candidate PIGS |$s_{1*}$| is also unknown. One estimate of this is what we will call the leave-the-|$j$|th-trial-out estimate:
To evaluate a biomarker as having any out-of-sample predictive value, we need a null model for comparison. Therefore, we similarly define |$D^{0}_j$| and |$D^{0}$| as the absolute prediction error based on risk estimands conditioning on any included baseline covariates, but not including the candidate surrogate, for the leave-the-|$j$|th-trial-out estimates:

For all these estimates, we are comparing the true CEP for subject |$i$| in trial |$j$| to our estimate based on information from the other |$J-1$| trials. However, we can never observe the true |$h_{ij}(s_{1j})$|⁠. We will approximate the desired estimates by substituting |$\widehat{h}_{ij}(s_{1j})$| for the true CEP. These approximations are denoted by using a |$\widetilde{.}$| instead of a |$\widehat{.}$| over the |$D$| estimates. As |$D$| estimation is based on the CEP estimates rather than actual clinical outcome value, any CEP estimation method can be used without adapting the suggested |$D$| estimation methods.

In Gabriel and others (2016) a similar approximation method for |$\widehat{D}$| was suggested, in which the estimate of the true value was based on the most flexible and complete data model. Here, we suggest that the model containing |$S$| is a better fit than the one not containing |$S$| within a given trial |$j$| because wide effect modification holds, from criterion 1, and that this is also the most flexible model, in its class of models, as it allows for trial-specific effects. However, we do not know if the CEP based on all the other trials, including |$S$|⁠, is more similar to this flexible estimate or to the risk models conditioning on baseline variables alone. Just because |$S$| is a high quality surrogate within a trial, does not necessarily imply that |$\widetilde{D}_{j}$| will be significantly smaller than |$\widetilde{D}^{0}_{j}$|⁠.

For example, if the range of |$S_j(1)$| overlaps with the range in several other trials, but the underlying distributions of the |$S_j$|s are different, as well as their associations with the clinical outcome, predictions based on a set of baseline variables may be more accurate for trial |$j$|⁠, even if within trial |$j$| the surrogate, |$S$|⁠, has value as a specific surrogate. We propose to test the null hypotheses |$\widetilde{D}\geq \widetilde{D}^{0}$| for the evaluation of criterion 2. We can then use |$\widetilde{D}$| to rank all candidate surrogates with any value as a PIGS and to give prediction intervals, within which the individual-level treatment effects for a new subject would be expected to lie. The proposed criteria cannot be evaluated using the observable data without additional assumptions.

4. Assumptions

To link the proposed risk estimands to observable data, we must make some assumptions. As outlined in Wolfson and Gilbert (2010), and many other works, the following Assumptions A1–A3 help identify the risk estimands:

  • A1: Stable Unit Treatment Value Assumption (SUTVA), or no interference, and Consistency

  • A2: Ignorable Treatment Assignment

  • A3: Equal individual clinical risk until the measurement of the set of biomarkers |$A$|

Assumption A1 means that subjects’ outcomes are independent of other subjects’ treatment and outcomes and that we can measure subjects’ true counterfactual values, if their randomization allows. Assumption A2 means that the counterfactuals we observe, under A1, are the same types of outcomes that we would have observed had the randomization been realized differently. Assumptions A1 and A2 imply that |$Y(z) = Y|Z=z$| and |$S(z) = S|Z=z$|⁠. Assumption A3 allows us to remove subjects that have an event prior to the measurement of |$S$| from the analysis without causing bias, as their risk of an event is exactly the same prior to that time point. Assumption A3 and the relaxation of it were explored in Wolfson and Gilbert (2010). Using these assumptions we can link the risk estimands to a model for the clinical outcomes based on observed data. However, as we will not observe a subject’s candidate surrogate measurements under both values of |$z$|⁠, we need some way to account for the missing counterfactual data to identify either of the joint and one of the marginal conditional risk estimands.

Follmann (2006) and Gabriel and Follmann (2016) introduced several trial designs which gather additional information and allow for the identification of the marginal, and potentially the joint, conditional risk estimands under Assumptions A1–A3. Follmann (2006) called trials that gather this additional information augmented trials. Greater detail on trial augmentation is given in the supplementary material available at Biostatistics online. The simplest form of augmentation, and the one available in our motivating example, is the measurement of baseline variable(s) which is(are) predictive of the candidate surrogate of interest in the intervention arm. We assume that each trial used for evaluation has at least some type of augmentation, allowing for the identification of the marginal risk estimands. However, not all trials need to have the same augmentation.

5. Estimation models and testable hypotheses

Under Assumptions A1–A3, and some type of augmentation in each trial, we propose to fit a parametric model for risk in each trial separately. The desired model for risk will vary based on the clinical outcome of interest. Assuming |$S(0)$| is equal to some constant for all subjects in all trials, for ease of illustration we will consider a simple risk model for our simulations and motivating example given by:
(5.1)

Here |$g(x) = e^x$|⁠. We assume the distribution to be Poisson so that the risk is the expected value of a Poisson random variable and is log-linear in the variables.

Various methods have been suggested in the literature to estimate the assumed risk model given the large amount of missing data, including full likelihood, EML, pseudoscore, EM algorithm and a number of Bayesian formulations (Follmann, 2006; Gilbert and Hudgens, 2008; Huang and Gilbert, 2011; Huang and others, 2013; Li and others, 2010, 2011. We will use EML estimation, which relies on bootstrap inference due to the lack of a closed form variance in our setting. We outline EML estimation below, assuming that all trials are augmented as in our motivating example with baseline variables that are predictive of the candidate surrogate of interest.

Let |$W_j$| be the baseline variable(s) measured in trial |$j$| that are predictive of |$S_j(1)$|⁠. Using EML with this augmentation, the likelihood contribution for each subject with |$S(1)$| measured, under Assumptions A1–A2, is standard, and the likelihood contribution for those subjects without |$S(1)$| measured is the integral of the standard likelihood over |$\hat{F}_j(S(1)|W_{j})$|⁠. The nuisance term |$F_j(S(1)|W_{j})$| is estimated using those subjects with both |$S(1)$| and |$W$| measured, before fitting the risk model, hence the name EML (Pepe and Fleming, 1991). As we are not assuming that |$S(1)$| has the same distribution over all the trials, we allow for the nuisance term estimation to vary by trial. Therefore |$\hat{F}_j(S(1)|W_{j})$| can depend on |$j$| and different predictors and even augmentation types can be used for the different trials.

For our motivating example and simulations, we will use a CEP of the negative of the log of the risk ratio |$-\log({\rm{risk}}_1 / {\rm{risk}}_0)$|⁠, so that it is linear in |$S(1)$| and CEP|$>0$| is in the direction of positive efficacy. The within-trial testable null hypotheses assuming this risk model and CEP are as follows: wide effect modification, |$\gamma_{j3} = 0$|⁠. One way to ensure one-sided strong average causal sufficiency, is that the CEP is increasing in |$s_1$|⁠, |$\gamma_{j3} <0$| and that when |$S(1)=0$| the CEP is |$\geq 0$|⁠, |$\gamma_{j1} \leq 0$|⁠. However, if |$\gamma_{j1} < 0$| we will reject average causal necessity. To support average causal necessity we suggest the estimate of |$\widehat{\rm{CEP}}(s_j=0)=\widehat{\gamma}_{j1}$| be given along with its confidence interval. Under this assumed model and CEP, all tests for and investigations of the criteria can be undertaken with Wald tests or confidence intervals based on the bootstrap standard errors.

To consider out-of-sample prediction error we can use the same estimation procedure for the |$J*2$| models of interest. The leave-the-|$j$|-th-trial out risk model based on information from all other trials:
(5.2)

We also fit the null model |${\rm{risk}}_{z,J(-j)}=g\{\alpha_{J(-j)0} + \alpha_{(j)1}*Z\}$|⁠.

We wish to obtain |$\widetilde{D}_j = \sum_i |\widehat{\rm{CEP}}_{ij} - \widehat{\rm{CEP}}_{ij,J(-j)}|$|⁠, which given Assumption 2 can be estimated by averaging over the subjects in the treatment arm that have |$S(1)$| measured. We want to compare this prediction error to that of the null model wherein we do not include |$S(1)$|⁠, |$\widetilde{D}^{0}_j$|⁠. For evaluation of candidate PIGS we want to compare |$\widetilde{D}$| to |$\widetilde{D}^{0}$|⁠, where these are just the average of the estimated |$\widetilde{D}_j$| and |$\widetilde{D}^{0}_j$| over the trials. If the candidate PIGS has significant predictive value, |$\widetilde{D}$| will be significantly smaller than |$\widetilde{D}^{0}$|⁠. Inverted non-parametric bootstrap confidence intervals can be used to test for significant differences, as is shown in the simulations.

For comparison of two candidate surrogates, both of which pass the minimum criteria 1 and 2 for a PIGS, the surrogate for which criterion 3 also holds is preferred. Ideally this candidate would also have a smaller |$\widetilde{D}$| than the other candidate PIGS. One can test this in a similar fashion to the test of the null for criterion 2: |$\widetilde{D}^{l} >\widetilde{D}^{k}$|⁠, where |$k$| and |$l$| indicate they are for different candidate PIGS. Again, inverted non-parametric bootstrap standard confidence intervals can be used to test for significant differences. Even when there is no significant difference, the confidence intervals can be used to help identify the superior predictor.

We will demonstrate this concept in the simulated examples and application. The procedure for evaluation and comparison using EML and bootstrap inference is as follows:

  • 1. Fit models for |${\rm{risk}}_{zj}(s_{1j})$| for each of the |$j$| trials, and candidate PIGS.

  • 2. Fit model for |${\rm{risk}}_{zJ(-j)}(s_{1j})$| for each of the |$j$| trials, and candidate PIGS using data from the |$J-1$| other trials, and then use this model to make predictions in trial |$j$|⁠, via 5.2.

  • 3. Calculate |$\widetilde{D}_j$| using estimates from Steps 1 and 2 for each of the candidate PIGS, and use these to calculate a |$\widetilde{D}^{k}$| for each of the |$k$| candidate PIGS.

  • 4. Fit |$risk_{zJ(-j)}(\mathbf{b})$| using data from the |$J-1$| other trials, and then use this model to make predictions in trial |$j.$|

  • 5. Calculate |$\widetilde{D}^{0}_j$| using estimates from step 1 and 4 for each trial, and use these to calculate |$\widetilde{D}^{0}$|⁠.

  • 6. Calculate |$\widetilde{D}^{k}- \widetilde{D}^{0}$| and |$\widetilde{D}^{l} -\widetilde{D}^{k}$| for all |$K$| candidate PIGS and all |$l \neq k.$|

  • 7. Bootstrap within each trial, repeat steps 1–6.

  • 8. Use bootstrap inference from Step 1 to test for/investigate wide effect modification, as well as one-sided strong average causal sufficiency and average causal necessity.

  • 9. Use bootstrap inference from Step 6 to test for and quantify the predictive value of the candidate PIGS in a new setting and compare candidates.

The approximate absolute prediction error |$\widetilde{D}$| is also useful when making predictions. Once the above procedure has been used and a high quality PIGS found, one can use the following prediction procedure for estimation of the true CEP for a whole trial or a given individual in a new setting:

  • 1. Fit model for |${\rm{risk}}_{zJ}(s_{1J})$| using all trial data, and obtain an estimate of the CEP for a new subject |$i^{\star}$| for a new trial |$J+1.$|

  • 2. Bootstrap within each trial and repeat step 1 to obtain bootstrap confidence bands for the generated CEP curve for a new trial |$J+1$|⁠, or subject |$i^*$| from this setting.

  • 3. Add |$\widetilde{D}$| to the upper limit and subtract it from the lower limit of the bootstrap 95% confidence intervals for the CEP to obtain prediction intervals.

Although the confidence interval for |$\widetilde{D}$| should give boundaries for the expected error for CEP curve prediction versus the estimate one would have obtained if one had observed the within-trial estimate, this does not completely contain the error expected in estimating the CEP value within that trial. Therefore we use the two-part interval given above in Step 3 as a conservative prediction interval, both with expected error made and the uncertainty in the estimates about the truth. The |$\widetilde{D}$| interval will contain much of the estimated uncertainty, but not that of the full model, and as mentioned before, this approximation uses an estimated value as the truth.

6. Simulated examples

To evaluate the operating characteristics of our proposed procedure, we simulated and analysed data as follows. For each of |$J$| trials with sample sizes |$n_1, \ldots, n_J$|⁠, we generated individual-level data according to a Poisson model for risk. Treatments were randomized at a 1:1 ratio, and the candidate PIGS were sampled from a normal distribution. Specifically, for trial |$j$|⁠, we have |$Z_{j} \sim \mbox{Binom}(n_j, .5)$|⁠, |$W_{j} \sim N(0, \sigma^2_{wj})$|⁠, |$S_{ij}(0) = 0 $| and |$S_{ij}(1) = s_{1ij}= \left\{ \mu_{sj} + \gamma*w_{ij} + \varepsilon_i\right\}$|⁠, where |$\varepsilon_i \sim N(0, \sigma^2_{sj})$| and |$N(m, v)$| denotes a normally distributed random variable with mean |$m$| and variance |$v$|⁠. Then the data generation mechanisms within each trial are:

Then |$Y_{ij}(z) \sim \mbox{Poisson}\{\lambda_{zj}(s_{1ij})\}$| for |$z = 0, 1$|⁠. We consider five scenarios under a similar setting to the motivating example of rotavirus vaccine in 12 trials. The five scenarios we consider are: first, a high quality specific surrogate in half of the 12 trials that is useless in the other trial settings; second, a surrogate with wide effect modification in all trials, but opposite directions of association between risk and |$S(1)$| in the two halves of the trials; third, a surrogate with wide effect modification in all trials in the same positive direction so ACS holds, but all the trials have different levels of efficacy even after adjustment for baseline characteristics when |$s_1 = 0$|⁠; fourth, a surrogate with wide effect modification, one-sided strong average causal sufficiency, average causal necessity does not hold, but all trials have heterogeneous associations between outcome and S and there are different positive levels of efficacy when |$s_{1} = 0$|⁠. The fifth scenario is an example that fits all three criteria for a high quality PIGS. In it, there is a high quality surrogate and there are no trial effects on the outcome given the surrogate and baseline variables. Wide effect modification, one-sided strong average causal sufficiency and average causal necessity hold in all trials. A detailed specification of the scenarios given the models above is provided in the supplementary material available at Biostatistics online.

For each scenario we consider 12 trials of varying size, |$n_j \in [100, 1000]$| and with varying sampling of |$S$|⁠, with only the baseline variable augmentation. The variance parameters are chosen so that all baseline variables, |$W$|⁠, are at least 0.7 correlated with |$S(1)$|⁠, although the |$S(1)|W$| relationship may differ over the trials, as this has been shown to be needed for unbiased estimation in previous work (Gilbert and Hudgens, 2008; Huang and Gilbert, 2011; Gabriel and Gilbert, 2014). Although these scenarios are based on the motivating application, the simulations are not limiting because, as mentioned above, the |$D$| estimation methods are independent of the risk estimation. Any existing consistent marginal conditional risk estimation method could have been used.

Once generated, the data are analysed at the individual level using the R (R Core Team, 2016) package pseval (Sachs and Gabriel, 2016). Specifically, we use the pseval package to implement the EML to fit the Poisson risk models. We use semi-parametric EML estimation, first outlined in Huang and Gilbert (2011), to integrate over the unobserved likelihood contributions; to obtain bootstrap samples, we sample with replacement within each trial. Since here the CEP is linear in the coefficients, CEP confidence intervals are based on the linear combination of bootstrap standard errors of the estimated coefficients with a simple Bonferroni adjustment to ensure correct simultaneous coverage. All code for the simulations is in the supplementary material available at Biostatistics online. Our CEP of interest is: |$-\mbox{log}(RR)$| which is greater than zero when there is positive efficacy. As suggested above, the tests/investigation of wide effect modification, one-side strong average causal sufficiency and average causal necessity under this model and CEP are formulated here as bootstrap Wald tests and 95% CI for each of the 12 trials. As any one trial not rejecting, or not covering 0 in the case of average causal necessity, is evidence against the criteria, the 12 tests being performed will not increase the type 1 error rate. We give the type 1 error rate/power for these tests and the test based on |$\widetilde{D} - \widetilde{D}_0$| for each of the scenarios in Table 1.

Table 1.

Power for tests of surrogate value. Test for wide effect modification is a Wald test for the interaction coefficient based on bootstrapped standard errors. The CEP curves increasing in |$S(1)$| test is a one-sided version of the test for wide effect modification (wide effect modification). For average causal necessity, we report the average estimate of |$\widehat{CEP}(S(1)=0)$|⁠. The test for |$\widetilde{D}$| vs |$\widetilde{D}^0$| is the test described in Section 3. Simulation results based on 300 replicates, and 100 bootstraps each

 Wide effect modificationCEP increasing in |$s_1$||$\widehat{CEP}(S(1)=0)$|Test for |$\widetilde{D}$| vs |$\widetilde{D}^0$|
Scenario 10.000.000.100.00
Scenario 20.350.00–0.020.00
Scenario 30.230.410.260.17
Scenario 40.180.32–0.120.91
Scenario 50.830.820.001.00
 Wide effect modificationCEP increasing in |$s_1$||$\widehat{CEP}(S(1)=0)$|Test for |$\widetilde{D}$| vs |$\widetilde{D}^0$|
Scenario 10.000.000.100.00
Scenario 20.350.00–0.020.00
Scenario 30.230.410.260.17
Scenario 40.180.32–0.120.91
Scenario 50.830.820.001.00
Table 1.

Power for tests of surrogate value. Test for wide effect modification is a Wald test for the interaction coefficient based on bootstrapped standard errors. The CEP curves increasing in |$S(1)$| test is a one-sided version of the test for wide effect modification (wide effect modification). For average causal necessity, we report the average estimate of |$\widehat{CEP}(S(1)=0)$|⁠. The test for |$\widetilde{D}$| vs |$\widetilde{D}^0$| is the test described in Section 3. Simulation results based on 300 replicates, and 100 bootstraps each

 Wide effect modificationCEP increasing in |$s_1$||$\widehat{CEP}(S(1)=0)$|Test for |$\widetilde{D}$| vs |$\widetilde{D}^0$|
Scenario 10.000.000.100.00
Scenario 20.350.00–0.020.00
Scenario 30.230.410.260.17
Scenario 40.180.32–0.120.91
Scenario 50.830.820.001.00
 Wide effect modificationCEP increasing in |$s_1$||$\widehat{CEP}(S(1)=0)$|Test for |$\widetilde{D}$| vs |$\widetilde{D}^0$|
Scenario 10.000.000.100.00
Scenario 20.350.00–0.020.00
Scenario 30.230.410.260.17
Scenario 40.180.32–0.120.91
Scenario 50.830.820.001.00

Table 2.

Performance of |$\widetilde{D}$|⁠: bias and prediction interval coverage. Simulation results based on 300 replicates and 100 bootstraps each

 Average |$\widehat{D} - \widetilde{D}$|Prediction coverage mean |$\widetilde{D}$|Prediction coverage 97.5 % ile |$\widetilde{D}$|
Scenario 1|$-0.01$|0.620.99
Scenario 2|$-0.05$|0.040.75
Scenario 3|$-0.01$|0.981.00
Scenario 4|$-0.03$|1.001.00
Scenario 5|$-0.09$|1.001.00
 Average |$\widehat{D} - \widetilde{D}$|Prediction coverage mean |$\widetilde{D}$|Prediction coverage 97.5 % ile |$\widetilde{D}$|
Scenario 1|$-0.01$|0.620.99
Scenario 2|$-0.05$|0.040.75
Scenario 3|$-0.01$|0.981.00
Scenario 4|$-0.03$|1.001.00
Scenario 5|$-0.09$|1.001.00
Table 2.

Performance of |$\widetilde{D}$|⁠: bias and prediction interval coverage. Simulation results based on 300 replicates and 100 bootstraps each

 Average |$\widehat{D} - \widetilde{D}$|Prediction coverage mean |$\widetilde{D}$|Prediction coverage 97.5 % ile |$\widetilde{D}$|
Scenario 1|$-0.01$|0.620.99
Scenario 2|$-0.05$|0.040.75
Scenario 3|$-0.01$|0.981.00
Scenario 4|$-0.03$|1.001.00
Scenario 5|$-0.09$|1.001.00
 Average |$\widehat{D} - \widetilde{D}$|Prediction coverage mean |$\widetilde{D}$|Prediction coverage 97.5 % ile |$\widetilde{D}$|
Scenario 1|$-0.01$|0.620.99
Scenario 2|$-0.05$|0.040.75
Scenario 3|$-0.01$|0.981.00
Scenario 4|$-0.03$|1.001.00
Scenario 5|$-0.09$|1.001.00

For the out-of-sample prediction error, we will test the null |$\widetilde{D} >\widetilde{D}^{0}$|⁠. In each of the scenarios we will also estimate the CEP for a given new subject and for each of the these predictions we provide coverage of the two different prediction intervals; one using |$\widetilde{D}$| and one using the 97.5 percentile of the bootstrapped |$\widetilde{D}$|s. Figure 1 gives the true CEP for each of the trials for one of the simulations, the overall CEP estimate, |$\widehat{CEP}_{J}$|⁠, and both prediction intervals.

Results from one replicate of the simulation study for each of the scenarios. The solid black line shows the overall estimate, the gray lines show the true CEP curves for each trial in the scenario, the dotted black lines show the 95% prediction interval using the estimated $\widetilde{D}$, and the dashed black lines are 95% prediction intervals based on the upper limit of a bootstrap 95% confidence interval for $\widetilde{D}$.
Fig. 1.

Results from one replicate of the simulation study for each of the scenarios. The solid black line shows the overall estimate, the gray lines show the true CEP curves for each trial in the scenario, the dotted black lines show the 95% prediction interval using the estimated |$\widetilde{D}$|⁠, and the dashed black lines are 95% prediction intervals based on the upper limit of a bootstrap 95% confidence interval for |$\widetilde{D}$|⁠.

The direct test for wide effect modification has far less than 5% type I error rate in Scenario 1, where half of the trials do not have any wide effect modification. In Scenario 2, where half of the trials have wide effect modification going in opposite directions, the direct test for wide effect modification still rejects a substantial proportion of the time. In contrast, the test for |$\widetilde{D} < \widetilde{D}^0$| incorporates information about the heterogeneity in the set of trials. Thus the power for that test is nearly 0 in Scenarios 1 and 2, which have a large degree of heterogeneity in the relationship between clinical efficacy and |$S(1)$|⁠. In scenario 3, where all of the trials have wide effect modification, but there is some heterogeneity, the |$\widetilde{D}$| test has roughly the same power as the test for wide effect modification. The power for the |$\widetilde{D}$| tests is much higher in Scenarios 4 and 5, where there is little to no heterogeneity among the trials with respect to the relationship between |$S(1)$| and the CEP.

|$\widetilde{D}$| has reasonably low bias for the true |$\widehat{D}$|⁠, and is on average larger than |$\widehat{D}$| in all cases, making prediction intervals based on it conservative. Coverage is overly conservative when using the upper 95% confidence limit for |$\widetilde{D}$|⁠, and we therefore recommend the use of the point estimate |$\widetilde{D}$|⁠. Even using |$\widetilde{D}$|⁠, the prediction interval coverage is good when the candidate PIGS has out-of-sample predictive value; the prediction interval coverage is expectedly poor when the candidate has no value as a PIGS.

7. Application

Although deaths from complications of rotavirus infection declined from an estimated |$>$|500,000 in children under five years of age in 2000 to approximately 250000 deaths in 2013 (Tate and others, 2016), a globally effective vaccine is still needed. The pentavalent rotavirus vaccine RotaTeqTM (RV5) developed by Merck & Co., Inc., Kenilworth, NJ USA, has been licensed for use in over 120 countries. The Rotavirus Efficacy and Safety Trial (REST) (Vesikari and others, 2006) was conducted in 11 countries. In the sub-study of REST in Finland and the USA, estimated efficacy against severe RVGE was as high as 98%. In separate studies in regions of Africa and Asia, lower efficacy was observed ranging from 39% to 48% (Armah and others, 2010; Zaman and others, 2010). Lower efficacy may be related to differences in participants’ immune system function. These data were provided by Merck & Co., Inc., Kenilworth, NJ USA, through data sharing agreements with the Fred Hutchinson Cancer Research Center and the National Institute of Allergy and Infectious Diseases.

We investigate several candidate PIGS using data from four phase II and III studies of RV5 conducted in seven countries (Vesikari and others, 2006; Heaton and others, 2005; Armah and others, 2010, 2012; Zaman and others, 2010; Shin and others, 2012). Although it was not the primary outcome in the trials, here we investigate RVGE of any severity as the clinical outcome. Some studies took place over multiple countries. For the purpose of this analysis, we assume participants from different countries in the same study to be independent trial settings, for a total of 12 useful evaluation trial settings. Several candidate PIGS were measured. These included serum anti-rotavirus IgA B-cell responses (IgA), as well as serum neutralization antibody (SNA) to the human rotavirus serotypes G1, G2, G3, G4, and P1A. Immune responses were all measured after the third dose of vaccine in each study. Linear combinations of baseline variables were developed as the predictive augmentation to account for the missing candidate PIGS values in the placebo subjects; only a limited set of baseline covariates including age, gender and race/ethnicity were available.

Unfortunately after investigating the candidate univariate PIGS, there was no evidence of a high quality PIGS. This may be due in part to the weak association of the available baseline variables and the candidate surrogates, Spearman correlations ranged between 0.21 and 0.33. Some of the candidate PIGS do however show evidence of specific individual-level surrogacy in some region/trial settings. For example, there is evidence of wide effect modification for G3 SNA in the Finland portion of the REST trial (⁠|$P=0.036$|⁠). Evidence of this predictive relationship is not apparent in the rest of the evaluation settings making it unlikely to be a good PIGS.

We report on a single marker, the log titer value of SNA G4, to illustrate the methods. The marker log titer value of SNA G4 was first transformed by subtracting its baseline value, allowing for the possibility of inducing a constant value for |$S(0)$|⁠. The maximum correlation between the linear combinations of the available baseline predictive variables and the SNA G4 candidate surrogate was 0.22 over the trials.

Figure 2 shows the within trial estimated CEP curves for this candidate PIGS with the CEP of the negative of the log risk ratio. The overall curve is illustrated by the thick black line, with the within trial curves illustrated by thinner lines with identifying symbols. These curves do not show wide variation over |$S(1)$| and the overall model is on average flat around the unconditional empirical estimate of the negative log risk ratio estimate over all the trials, 0.72. The test for wide effect modification fails to reject.

Evaluating the candidate PIGS difference in log serum neutralizing antibodies G4 from baseline to after the third dose by trial and overall. The solid black line is the overall estimate of $-$log(risk$_1$/risk$_0$) (negative log risk ratio) and the gray ribbon indicates the prediction interval based on $\widetilde{D}$. The labeled curves are the estimates within each of the sub-regions/trials.
Fig. 2.

Evaluating the candidate PIGS difference in log serum neutralizing antibodies G4 from baseline to after the third dose by trial and overall. The solid black line is the overall estimate of |$-$|log(risk|$_1$|/risk|$_0$|⁠) (negative log risk ratio) and the gray ribbon indicates the prediction interval based on |$\widetilde{D}$|⁠. The labeled curves are the estimates within each of the sub-regions/trials.

Evaluating the candidate PIGS difference in log serum neutralizing antibodies G4 from baseline to after the third dose by trial and overall. The solid black line is the overall estimate of -log(risk$_1$ / risk$_0$) (negative log risk ratio). The labeled curves are the leave-one-out estimates holding out data from each of the indicated sub-regions/trials.
Fig. 3.

Evaluating the candidate PIGS difference in log serum neutralizing antibodies G4 from baseline to after the third dose by trial and overall. The solid black line is the overall estimate of -log(risk|$_1$| / risk|$_0$|⁠) (negative log risk ratio). The labeled curves are the leave-one-out estimates holding out data from each of the indicated sub-regions/trials.

The estimated |$\widetilde{D}$| was 0.59. This estimate was added to the overall confidence band limits to obtain prediction intervals, as shown by the gray ribbon in Figure 2. The difference between |$\widetilde{D}$| and |$\widetilde{D}^0$| was nearly 0, averaged over the trials, suggesting that there is little evidence that log titer value of SNA G4 adds predictive values to the model; P-value 0.53. As we cannot reject the nulls for either criterion 1 or 2, we do not formally investigate evidence of criterion 3. However, from the plots one can see that average causal necessity and one-sided strong average causal sufficiency would likely not hold because the estimated CEPs are above 0 for all observed values of |$S(1)$| over all trials and some of the estimated trial specific CEP curves are decreasing in |$S(1)$|⁠.

8. Discussion

We introduce the definition for a PIGS as a surrogate that is predictive of individual-level efficacy in a new trial setting. We outline two criteria for a biomarker to have any value as a PIGS, with an additional criterion for an ideal PIGS. We demonstrate that in settings where some trial augmentation has been performed, not necessarily the same augmentation in each trial, the CEP-based evaluation of the suggested criteria allow for the evaluation and comparison of candidate PIGS based on both within-sample and out-of-sample predictive accuracy.

We provide a means to quantify the prediction error one would expect when using a PIGS to predict efficacy in a new setting, using a cross-validation estimate of the absolute error. Based on this error estimate, we develop prediction intervals for individual-level clinical efficacy predictions in a new trial setting. We show that for a validated PIGS, predictions for a new individuals or trial settings are relatively accurate and that our suggested prediction intervals have good coverage in scenarios where the candidate PIGS has predictive value.

It is important to note that evaluation of our proposed criteria only provides evidence that the candidate surrogate has predictive value outside the trials used to fit the current risk model and that it is a specific surrogate in all evaluation settings. The new settings in which the surrogate is useful as a predictor are still limited by the trials used for evaluation; if one uses a set of very similar trials, a candidate surrogate for which criteria 1–3 hold may still not be useful in dissimilar settings. This concept is not specific to PIGS or surrogacy, as even trial-level treatment effects should not be considered generalizable to differing settings without evidence that supports their transportability. The trial settings of our motivating application differ in many ways including region, ethnic group and vaccine dosage. Therefore, any findings from the application are plausibly applicable to a wide range of settings.

Although not the purpose of this work, one could view our method as way of testing the homogeneity assumption, as stated in Jiang and others (2016), that the clinical outcome is independent of the trial indicator |$j$| conditional on the candidate surrogate and individual-level baseline variables. We believe this assumption is often testable, conditional on sample size, provided trial augmentation is available via our proposed method. If there are trial-level effects that violate homogeneity, then they are either small enough that |$S(1)$| is still useful as a predictor in the CEP model or the relationship between the CEP and |$S(1)$| is different enough that |$S$| is no longer a useful predictor in a given trial based on the other trials’ risk models. Therefore, criterion 2 can be viewed as a weak form of homogeneity, that gives a clear testable hypothesis.

Limitations of this work include the need for augmentation, the lack of a closed form variance when EML estimation is used, and the computational intensity of the procedure. However, with increasing computing speed, bootstrapping a |$3*J$| modeling procedure, for some small fixed |$J$|⁠, seems increasingly trivial. As well, the conservative coverage of the prediction intervals suggests that improvements can be made with further research. An avenue for future research is to explore the possibility of improving coverage by investigating other types of bootstrap confidence intervals.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Acknowledgments

We are grateful to Merck & Co. Inc. (Kenilworth, NJ) for sharing their RotaTeqTM clinical trial data with us. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). Conflict of Interest: None declared.

Funding

The authors received no funding from Merck & Co. Inc. for this analysis nor did Merck & Co. Inc. influence the analysis in any way. National Institutes of Health (R37 AI032042 to M.E.H.).

References

Armah
G. E.
,
Breiman
R. F.
,
Tapia
M. D.
,
Dallas
M. J.
,
Neuzil
K. M.
,
Binka
F. N.
,
Sow
S. O.
,
Ojwando
J.
,
Ciarlet
M.
and
Steele
A. D.
(
2012
).
Immunogenicity of the pentavalent rotavirus vaccine in African infants.
Vaccine
30
,
A86
A93
.

Armah
G. E.
,
Sow
S. O.
,
Breiman
R. F.
,
Dallas
M. J.
,
Tapia
M. D.
,
Feikin
D. R.
,
Binka
F. N.
,
Steele
A. D.
,
Laserson
K. F.
,
Ansah
N. A.
and others (
2010
).
Efficacy of pentavalent rotavirus vaccine against severe rotavirus gastroenteritis in infants in developing countries in Sub-Saharan Africa: a randomized, double-blind, placebo-controlled trial.
The Lancet
376
,
606
614
.

Burzykowski
T.
,
Molenberghs
G.
,
Buyse
M.
,
Geys
H.
and
Renard
D.
(
2001
).
Validation of surrogate end points in multiple randomized clinical trials with failure time end points.
Journal of the Royal Statistical Society: Series C (Applied Statistics)
50
,
405
422
.

Buyse
M.
,
Molenberghs
G.
,
Burzykowski
T.
,
Renard
D.
and
Geys
H.
(
2000
).
The validation of surrogate endpoints in meta-analyses of randomized experiments.
Biostatistics
1
,
49
67
.

Follmann
D.
(
2006
).
Augmented designs to assess immune response in vaccine trials.
Biometrics
62
,
1161
1169
.

Frangakis
C. E.
and
Rubin
D. B.
(
2002
).
Principal stratification in causal inference.
Biometrics
58
,
21
29
.

Gabriel
E. E.
,
Daniels
M. J.
and
Halloran
M. E.
(
2016
).
Comparing biomarkers as trial level general surrogates.
Biometrics
72
,
1046
1054
.

Gabriel
E. E.
and
Follmann
D.
(
2016
).
Augmented trial designs for evaluation of principal surrogates.
Biostatistics
17
,
453
467
.

Gabriel
E. E.
and
Gilbert
P. B.
(
2014
).
Evaluating principal surrogate endpoints with time-to-event data accounting for time-varying treatment efficacy.
Biostatistics
15
,
251
265
.

Gabriel
E. E.
,
Sachs
M. C.
and
Gilbert
P. B.
(
2015
).
Comparing and combining biomarkers as principal surrogates for time-to-event clinical endpoints.
Statistics in Medicine
34
,
381
395
.

Gilbert
P. B.
and
Hudgens
M. G.
(
2008
).
Evaluating candidate principal surrogate endpoints.
Biometrics
64
,
1146
1154
.

Gilbert
P. B.
,
Hudgens
M. G.
and
Wolfson
J.
(
2011
).
Commentary on “Principal stratification—a goal or a tool?” by Judea Pearl.
The International Journal of Biostatistics
7
,
1
15
.

Gilbert
P. B.
,
Qin
L.
and
Self
S. G.
(
2008
).
Evaluating a surrogate endpoint at three levels, with application to vaccine development.
Statistics in Medicine
27
,
4758
4778
.

Heaton
P. M.
,
Goveia
M. G.
,
Miller
J. M.
,
Offit
P.
and
Clark
H. F.
(
2005
).
Development of a pentavalent rotavirus vaccine against prevalent serotypes of rotavirus gastroenteritis.
Journal of Infectious Diseases
192
,
S17
S21
.

Huang
Y.
and
Gilbert
P. B.
(
2011
).
Comparing biomarkers as principal surrogate endpoints.
Biometrics
67
,
1442
1451
.

Huang
Y.
,
Gilbert
P. B.
and
Wolfson
J.
(
2013
).
Design and estimation for evaluating principal surrogate markers in vaccine trials.
Biometrics
69
,
301
309
.

Jiang
Z.
,
Ding
P.
and
Geng
Z.
(
2016
).
Principal causal effect identification and surrogate endpoint evaluation by multiple trials.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
78
,
829
848
.

Joffe
M. M.
and
Greene
T.
(
2009
).
Related causal frameworks for surrogate outcomes.
Biometrics
65
,
530
538
.

Li
Y.
,
Taylor
J. M. G.
and
Elliott
M. R.
(
2010
).
A Bayesian approach to surrogacy assessment using principal stratification in clinical trials.
Biometrics
66
,
523
531
.

Li
Y.
,
Taylor
J. M. G.
,
Elliott
M. R.
and
Sargent
D. J.
(
2011
).
Causal assessment of surrogacy in a meta-analysis of colorectal cancer trials.
Biostatistics
12
,
478
492
.

Pepe
M. S.
and
Fleming
T. R.
(
1991
).
A nonparametric method for dealing with mismeasured covariate data.
Journal of the American Statistical Association
86
,
108
113
.

Prentice
R. L.
(
1989
).
Surrogate endpoints in clinical trials: definition and operational criteria.
Statistics in Medicine
8
,
431
440
.

Qin
L.
,
Gilbert
P. B.
,
Corey
L.
,
McElrath
M. J.
and
Self
S. G.
(
2007
).
A framework for assessing immunological correlates of protection in vaccine trials.
The Journal of Infectious Diseases
196
,
1304
1312
.

Qin
L.
,
Gilbert
P. B.
,
Follmann
D.
and
Li
D.
(
2008
).
Assessing surrogate endpoints in vaccine trials with case-cohort sampling and the Cox model.
Annals of Applied Statistics
2
,
386
407
.

R Core Team
(
2016
).
R: A Language and Environment for Statistical Computing
.
R Foundation for Statistical Computing,
Vienna, Austria.
https://www.R-project.org/.

Robins
J. M.
and
Greenland
S.
(
1992
).
Identifiability and exchangeability for direct and indirect effects.
Epidemiology
3
,
143
155
.

Sachs
M. C.
and
Gabriel
E. E.
(
2016
).
An introduction to principal surrogate evaluation with the pseval package.
The R Journal
8
,
277
292
.

Shin
S.
,
Anh
D. D.
,
Zaman
K.
,
Yunus
M.
,
Mai Le
T. P.
,
Thiem
V. D.
,
Azim
T.
,
Victor
J. C.
,
Dallas
M. J.
,
Steele
A. D.
and others (
2012
).
Immunogenicity of the pentavalent rotavirus vaccine among infants in two developing countries in Asia: Bangladesh and Vietnam.
Vaccine
30
,
A106
A113
.

Tate
J. E.
,
Burton
A. H.
,
Boschi-Pinto
C.
and
Parashar
U. D.
(
2016
).
Global, regional, and national estimates of Rotavirus mortality in children under 5 years of age,
2000
2013
.
Clinical Infectious Diseases
62
,
S96
S105
.

Vesikari
T.
,
Matson
D. O.
,
Dennehy
P.
,
Van Damme
P.
,
Santosham
M.
,
Rodriguez
Z.
,
Dallas
M. J.
,
Heyse
J. F.
,
Goveia
M. G.
,
Black
S. B.
and others (
2006
).
Safety and efficacy of a pentavalent human-bovine (WC3) reassortant rotavirus vaccine.
New England Journal of Medicine
354
,
23
33
.

Wolfson
J.
and
Gilbert
P. B.
(
2010
).
Statistical identifiability and the surrogate endpoint problem, with application to vaccine trials.
Biometrics
66
,
1153
1161
.

Zaman
K.
,
Anh
D. D.
,
Victor
J. C.
,
Shin
S.
,
Yunus
M. D.
,
Dallas
M. J.
,
Podder
G.
,
Thiem
V. D.
,
Mai Le
T. P.
,
Luby
S. P.
and others (
2010
).
Efficacy of pentavalent rotavirus vaccine against severe rotavirus gastroenteritis in infants in developing countries in Asia: a randomised, double-blind, placebo-controlled trial.
The Lancet
376
,
615
623
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Supplementary data