Simultaneous confidence intervals that are compatible with closed testing in adaptive designs

Summary We describe a general method for finding a confidence region for a parameter vector that is compatible with the decisions of a two-stage closed test procedure in an adaptive experiment. The closed test procedure is characterized by the fact that rejection or nonrejection of a null hypothesis may depend on the decisions for other hypotheses and the compatible confidence region will, in general, have a complex, nonrectangular shape. We find the smallest cross-product of simultaneous confidence intervals containing the region and provide computational shortcuts for calculating the lower bounds on parameters corresponding to the rejected null hypotheses. We illustrate the method with an adaptive phase II/III clinical trial.


Introduction
For experiments designed to make inference about a parameter vector θ = (θ 1 , … , θ K ), it is common to find confidence intervals for all of the individual θ k such that the simultaneous coverage probability is at least 1 − α. Sometimes, though, an experimenter will only attempt to assert that an individual parameter exceeds a specific value, say θ k > δ k . If this cannot be achieved in such a way that the probability of making at least one incorrect rejection in a family of hypotheses H k = {θ k ⩽ δ k } (k = 1, … , K) is no greater than α, the experimenter will not assert anything about θ k . The latter method of inference is used in so-called closed test procedures (Marcus et al., 1976), and its advantage is often greater power.
For experiments conducted in a single stage, Hayter & Hsu (1994) showed how simultaneous 100(1 − α)% confidence intervals can be constructed to be compatible with some commonly used closed test procedures, in the sense that a null hypothesis H k is rejected at familywise level α if and only if the confidence interval for θ k excludes all values for which H k is true. Often, these intervals are scarcely more informative than the test decisions. For example, for one-sided problems where larger parameter values are more beneficial, no 100(1 − α)% lower confidence bound for any individual θ k can exceed δ k unless all hypotheses H 1 , … , H K can be rejected at familywise level α.
In this article we derive confidence intervals for adaptive experiments. Our motivating example is a seamless phase II/III clinical trial, although the method is not limited to this setting. Such trials consist of a first stage in which K experimental treatments, indexed by T 1 = {1, … , K}, are compared with a common control and, after an interim analysis, a second stage in which only a subset of treatments, indexed by T 2 ⊆ T 1 , are compared with the control. The state-of-the-art methodology for this problem (Bauer & Kieser, 1999;Posch et al., 2005;Bretz et al., 2009) is a hybrid of the closure principle of Marcus et al. (1976) and a p-value combination which goes back to Fisher (1932). This methodology allows any subset of treatments to be chosen at interim, based on all trial data and external factors. Other adaptations, such as sample size re-estimation, are also possible. A serious concern, though, is that there is no established method for constructing confidence intervals. As emphasized in the International Conference on Harmonisation's E9 guideline (ICH E9 Expert Working Group, 1999, p. 1932, 'Estimates of treatment effect should be accompanied by confidence intervals, whenever possible, and the way in which these will be calculated should be identified.' Posch et al. (2005) proposed 100(1 − α)% simultaneous confidence intervals following such a trial. Unfortunately, their intervals are not guaranteed to be compatible with the closed test procedure. Here, we construct intervals that are compatible. As in the one-stage case, an inevitable shortcoming of these intervals is that they are not always substantially more informative than the original test decisions. We will show that this problem is mitigated to some extent by the adaptive nature of the experiment.

2·1. Closure principle
The closure principle of Marcus et al. (1976) is a general method for multiple hypothesis testing. A formal description is given in Finner & Strassburger (2002), and we adopt similar notation here. Let be a family of probability measures defined on a common sample space (Ω, ), where Θ is a multi-dimensional parameter space.

Europe PMC Funders Author Manuscripts
where is the index set of true hypotheses under θ*. In other words, the probability of rejecting at least one true null hypothesis is bounded by α. This is known as strong control of the familywise error rate. The closure principle can be used to ensure (1). We are required to find, for each such that is nonempty, a local level-α test φ I for the intersection hypothesis H I ; that is, we require (2) where φ I takes values in {0, 1} with the usual interpretation. If we define , then (1) holds. This can be very useful, as in many applications it is easy to find tests satisfying (2), whereas validating (1) directly is hard. Fisher (1932) discussed combining independent p-values to test a single null hypothesis. For convenience and brevity, we will only consider two-stage designs. We define a p-value combination function Q: [0, 1] 2 ↣ [0, 1] that is left-continuous and nondecreasing in both its arguments and is uniformly distributed provided that both arguments are themselves independent and uniformly distributed. An example is (3) where Φ denotes the standard normal distribution function.

2·2. Combination test
Such a combination function lends itself to a two-stage adaptive closed test, ψ, for a family of null hypotheses, . An important application, discussed in Bretz et al. (2009), is a seamless phase II/III confirmatory clinical trial. We henceforth restrict attention to a parameter θ = (θ 1 , … , θ K ) taking values in parameter space and a family of null hypotheses where T 1 = {1, … , K} and H k = {θ k ⩽ δ k } (k ∈ T 1 ) for some constants . The θ k (k ∈ T 1 ) might correspond to the mean effects of K different treatments, for example. By defining local tests φ I (I ⊆ T 1 ) via a combination function , it is possible to make data-dependent modifications to the trial design at an interim analysis (cf. Bauer & Kieser, 1999;Hommel, 2001;Brannath et al., 2002). For instance, attention can be focused on a subset T 2 ⊆ T 1 of the initial hypotheses of interest; changes can be made to sample sizes, allocation ratios, etc.

2·3. Two-stage closed test procedure
Assume that the full first-stage trial data are represented by a random vector with distribution function G(x; θ). Prior to starting the trial, one must specify a combination function and, for each I ⊆ T 1 , a first-stage test of H i with an associated p-value function that satisfies for all . The second-stage design is unspecified.
At the interim analysis, the experimenter defines a second-stage design, d, by choosing a subset of the original hypotheses, indexed by T 2 ⊆ T 1 , to continue studying in the second stage, along with second-stage sample sizes and, for each I ⊆ T 1 , a second-stage hypothesis test for H I . See below for a proposal for choosing second-stage tests for H I where I ⊈ T 2 . We assume that the design d is allowed to depend on the unblinded first-stage data x without prespecifying an adaptation rule. Let Y denote the data collected at the second stage, taking values in , and let (I ⊆ T 1 ) denote the p-value functions of the second-stage tests. Because the tests used in the second stage depend on the first-stage data x and the chosen design d, the p-value functions will in general depend on both.
Let F x,d (y; θ) denote the distribution function of the second-stage data, given the chosen design d and interim data x. We assume that for all x, d and I ⊆ T 1 , the second-stage pvalues satisfy for all u ∈ [0, 1]. The distribution F x,d is assumed to be known, i.e., not merely specified up to a null set, for all x and d, a condition that can be formalized by assuming an appropriate regression model (Brannath et al., 2012). See § 3·2 for a numerical example.
At the final analysis, for each I ⊆ T 1 , the test decision is φ I = 1 if and only if . As shown in Brannath et al. (2012), this combination test for H I controls the Type I error rate at level α.
We assume that only data for the hypotheses indexed by T 2 are collected in the second stage and propose setting for I ⊈ T 2 , where we drop the indices x and d for simplicity and set by convention. Such second-stage p-values have the required distribution under H I∩T 2 and hence also under H I .
We emphasize that while Type I error control is guaranteed even if the second-stage design is initially open-ended, in the design of actual clinical trials it is crucial to perform detailed planning based on likely first-stage outcomes. The added flexibility is necessary because it is impossible to foresee all eventualities in extremely complex areas such as clinical drug development.

3·1. Partitioning the parameter space
A standard approach to deriving a 100(1 − α)% confidence set for θ is to perform a level-α test of each elementary hypothesis {θ = θ*} (θ* ∈ Θ) and include all θ* corresponding to nonrejected hypotheses (see, e.g., Lehmann, 1986, p. 90). To ensure compatibility with closed testing, the key idea (Stefansson et al., 1988;Hayter & Hsu, 1994;Finner & Strassburger, 2002) is to partition the parameter space into disjoint regions is constant in all arguments such that i ∉ I ∩ T j , and is leftcontinuous and nondecreasing in all arguments such that i ∈ I ∩ T j , with for any θ* such that for all i ∈ I ∩ T j . Furthermore, we assume that (6) PROPOSITION 1. Inserted into (5), the following families of hypothesis tests give rise to a 100(1 − α)% confidence set for θ, denoted by C, that is compatible with the two-stage closed test procedure, i.e., ψ k = 1 if and only if H k ∩ C = ∅: for ∅ ≠ I ⊆ T 1 and θ* ∈ Θ, and {φ ∅ (θ*): θ* ∈ Θ } is any family of tests satisfying (4).
Proof. See the Appendix. There will be no unique collection of families of p-values satisfying the aforementioned distributional and monotonicity constraints. Rather, the families must be specified in a twostage procedure in an analogous way to the p-values in § 2·3. As will become clear from the example below, for many commonly encountered scenarios and when I ∩ T j ≠ ∅, the choice of will be obvious from the choice of . As a simple example, suppose that is the p-value from a one-sided z-test of the null hypothesis {θ k ⩽ δ k } using the stage-j data only. Then the natural choice for is the one-sided p-value from a standard z-test of using the same stage-j data.
While for I ∩ T j ≠ ∅ there will often be a natural choice for , it is unclear how φ ∅ (θ*) and should be chosen. A reasonable suggestion is given below.
COROLLARY 1. Define for j = 1,2. The following is a 100(1 − α)% confidence region for θ that is compatible with the two-stage closed test procedure: The properties of a region defined by (8) are best illustrated by a specific example. Based on this assumption, Simes (1986) (9) where is defined as for all θ* ∈ Θ.

3·2. Example
The region (9) will have a complicated three-dimensional shape. However, in terms of making inference on θ B , its crucial features can be seen by taking two cross-sections, as displayed in Fig. 1. As is nondecreasing in for all I ⊆ T 1 , we know that for any γ ∈ (-∞, 0), the cross-section at is contained in the cross-section at . Similarly, for any γ ∈ (0, ∞), the cross-section at is contained in the limit of the cross-section of the region as . One can see immediately from Fig. 1 that for any ϵ > 0, the 97·5% confidence region fails to exclude all parameter vectors θ* such that . In other words, the lower confidence bound on θ B provides no more information than the decision of the closed test procedure.
For confidence intervals that are compatible with single-stage closed test procedures (Hayter & Hsu, 1994;Strassburger & Bretz, 2008;Guilbaud, 2008), a necessary condition for obtaining informative lower confidence bounds for parameters corresponding to the rejected null hypotheses is that ψ k =1 for all k ∈ T 1 . In the adaptive setting, this is no longer a necessary condition. For example, repeating the above test procedure at level α=0·05, the compatible 95% confidence region analogous to (9) is also summarized in Fig. 1. Here it appears, and indeed can be verified by considering all values of , that there does exist some ϵ > 0 such that the confidence region excludes all parameter vectors θ* for which . We will show that for the two-stage adaptive setting, a necessary condition for informative lower confidence bounds on parameters corresponding to the rejected null hypotheses is that ψ k =1 for all k ∈ T 2 . However, as can be seen from Fig. 1, this condition is not sufficient. Posch et al. (2005) proposed the following 100(1 − α)% confidence region:

3·3. A two-stage, single-step confidence region
(10) They note that the resulting confidence intervals are not compatible with the closed test procedure described in § 2·3 (Posch et al., 2005, p. 3702). Nevertheless, the region (10) can be used to generate an alternative multiple test. More generally, any 1 − α confidence set C generates a multiple test for a family of hypotheses , whereby is rejected if and only if H k ∩ C = ∅. This guarantees strong control of the familywise error rate (1). The multiple test generated by (10) can be thought of as single-step in the sense that rejection or nonrejection of a null hypothesis does not take into account the decision for any other hypothesis. If H k is rejected, informative lower bounds will be available for θ k regardless of the test decisions for all other hypotheses.

Computation of confidence intervals 4·1. Least-favourable parameter configurations
In the above example, marginal inference on θ B was achieved by considering leastfavourable parameter configurations for θ k , k ∈ T 1 \ {B}. This idea can be generalized to find 100(1 − α)% simultaneous confidence intervals containing (8) or (10). DEFINITION 1. For j = 1, 2, k ∈ T 1 and I ⊆ T j , the locally least-favourable jth-stage p-value function for H k in Θ I , , is defined for I ≠ ∅ as , where ξ =(ξ 1 , … , ξ K ) with ξ i =δ i for i ≠ k and ξ k = ϑ. Additionally, for j = 1, 2, PROPOSITION 2. The smallest Cartesian product of intervals, × k∈T 1 (l k , ∞), that contains the confidence region (8) has l k = min I⊆T 1 l k,I , where for k ∈ I, and for k ∉ I,  (13) Furthermore, these intervals are compatible with the two-stage closed test procedure, i.e., ψ k = 1 if and only if H k ∩ × k∈T 1 (l k , ∞)=∅.
Proof. See the Appendix.
In general, to find each interval requires one-dimensional root finding for each I ⊆ T 1 , a calculation that is O(2 K ). However, substantial shortcuts are available for reducing the computational burden.

4·2. Efficient computation of confidence bounds
There are two possible scenarios at the end of the closed test procedure: either ψ k = 1 for all k ∈ T 2 , or at least one H k (k ∈ T 2 ) fails to be rejected. In the latter case, there exists some I ⊆ T 1 with I ∩ T 2 ≠ ∅ such that for any k ∈ T 2 , and therefore l k ⩽ l k,I ⩽ δ k . Due to the compatibility of the intervals with the closed test procedure, if ψ k = 1, then l k = δ k ; if ψ k = 0, then l k < δ k .
If ψ k =1 for all k ∈ T 2 , then l k ⩾ δ k for all k ∈ T 2 . Additionally, we can use the fact that for all k ∈ T 2 and I ⊆ T 1 with I ∩ T 2 ≠ ∅, we know from (12) and (13) that l k,I = ∞; so, when finding l k =min I⊆T 1 l k,I in Proposition 2, the minimum can be taken over a much smaller number of l k,I . The following algorithm finds the lower bounds for all parameters corresponding to the rejected hypotheses.
Step 1. Perform the closed test procedure. If ψ k ′ = 0 for some k′ ∈ T 2 , then l k = δk for ψ k =1 and l k < δ k for ψ k =0. If ψ k =1 for all k ∈ T 2 , go to Step 2.
Step 3. For k ∈ T 2 , The cost of computing the intervals for θ k (k ∈ T 2 ) in Step 3 is linear in the number of parameters.
Step 2 is O(2 |T 1 \T 2 | ), but a shortcut of size |T 1 \ T 2 | is available, provided there exists an ordering i 1 , … , i k of T 1 \ T 2 such that for each u ∈ {1, … , k}, for all J ⊆ L ⊆ {i u , … , i k } with i u ∈ J. This is because we only have to check for u =1, … , k. Many common multiple test procedures, such as those based on Dunnett (1955) tests or weighted Bonferroni tests, satisfy this condition, with the ordering i 1 , … , i k following the ordering of the univariate test statistics or the weighted elementary p-values (Brannath & Bretz, 2010).

4·3. Lower bounds for parameters corresponding to retained hypotheses
Consider k ∈ T 2 such that ψ k = 0. We know that l k < δ k , and therefore we need only consider l k,I such that k ∈ I. However, since in general l k,I < ∞, finding the minimum such lower bound will still have a computational cost that is exponential in the number of parameters.
For k ∈ I ⊆ T 1 \ T 2 , we have and know from (11) and (6) that this is equal to 1. Many commonly used combination functions, including (3), have the property that v = 1 implies . In this case, l k = −∞ for all k ∈ T 1 \ T 2 .
4·4. Lower bounds for the two-stage single-step procedure Posch et al. (2005) showed that the region (10) is contained in a rectangle, , where (14) The computation of each interval requires only a one-dimensional search for a root, and overall computation will be linear in the number of parameters.

4·5. Example continued
Recall from § 3·2 that T 2 = {B} and ψ B = 1. Proceeding to Step 2 of the above algorithm, p M =0·419. In this case we need just one iteration in Step 3, because and therefore the 97·5% confidence interval for θ B is (0, ∞), consistent with Fig. 1. This example emphasizes that there is a price to pay for the additional power of the closed test as opposed to the single-step procedure of § 3·3 with, by (14), While this agrees with the assertion θ B > 0 in this specific case, it is invalid to claim it as a 97·5% lower confidence bound if the closed test procedure of § 2·3 had been planned. One can see that for any α > 0·036, the 100(1 − α)% confidence interval for treatment B that is compatible with the closed test procedure has a positive lower bound. For example, the 95% lower confidence bound is l B = 0·0112, consistent with Fig. 1 Europe PMC Funders Author Manuscripts

Confidence bounds for closed tests based on the conditional error rate
Consider again the two-stage closed test procedure of § 2·3. As an alternative to combination tests, Koenig et al. (2008) used the conditional error approach (Proschan & Hunsberger, 1995) to derive local tests φ I (I ⊆ T 1 ). The only difference is that instead of prespecifying a combination function Q and first-stage p-value , one must prespecify a measurable conditional error function such that and, at the final analysis, φ I =1 if and only if .
To produce a compatible 100(1 − α)% confidence region for θ, each A I (I ⊆ T 1 ) must be augmented with a family of conditional error functions {A I (θ*) : θ* ∈ Θ} such that and, for fixed , A I (θ*) is constant in all arguments with i ∉ I and is left-continuous and nonincreasing in all arguments with i ∈ I. Furthermore, A I (θ*)= A I for all θ* ∈ Θ such that for i ∈ I. The second-stage p-values must be augmented with a family as described in § 3·1. Müller & Schäfer (2004) propose defining A I = sup θ*∈H I E θ* (ϕ I | X), where ϕ I is a preplanned fixed sample level-α test for H I . In many situations the natural choice for A I (θ*) will be obvious from A I . For example, if ϕ I is the decision function for a Dunnett (1955) test of H I = ⋂ k∈ I {θ k ⩽ δ k }, then it is natural to choose A I (θ*) = E θ* (ϕ I,θ* | X) where ϕ I,θ* is the decision function for a Dunnett test of which can be derived via a corresponding translation of the test statistics.
Using the arguments of Propositions 1 and 2, it can be shown that, analogously to (8), a compatible 100(1 − α)% confidence region for θ is where and A ∅ (θ*) are set equal to and A T 1 (θ*) respectively. Also, the largest compatible 100(1 − α)% confidence lower bounds are l k =min I⊆T 1 l k,I , where for k ∈ I, and for k ∉ I, with A k,I (ϑ) defined analogously to in Definition 1.

Concluding remarks
The lower confidence bounds (12)-(13) provide more information about the location of θ than the decisions of the closed test procedure of § 2·3. The utility of this additional information will depend strongly on the context. In practice, the primary concern will often be to find lower bounds for the components of θ corresponding to the rejected null hypotheses. As this can be achieved using an algorithm that is O(K 2 ), application to largescale simultaneous inference problems is, in principle, feasible. However, these lower bounds will only be informative if all hypotheses considered in the second stage of testing are rejected, and even this may be insufficient. In practice, therefore, the lower bounds (12)-(13) are only likely to be useful in relatively small-scale problems. Furthermore, in situations where informative lower confidence bounds are deemed to be more important than the possibility of rejecting as many individual null hypotheses as possible, it would be sensible to use the intervals (14) instead of applying the closed test procedure. For large-scale simultaneous inference problems, an approach based on controlling the false coveragestatement rate (Benjamini & Yekutieli, 2005) may be more appropriate than aiming for a high simultaneous coverage probability.
Extensions to more than two stages and to allow early rejection of hypotheses are straightforward with an appropriate combination function in place of (3). An open question is how best to choose φ ∅ (θ*) and .
The tests we use in region (8) are a natural choice but may not be the most powerful. that . The same inequality follows from and (13) if k ∉ I. Therefore, θ* ∉ C 1 and C 1 ⊆ × k∈T 1 (l k , ∞).
To show that no smaller interval (l k + ϵ, ∞) is possible for any ϵ > 0, we must find some θ* ∈ C 1 with . Consider a subset I ⊆ T 1 such that l k = l k,I and therefore for all ϑ > l k . If k ∈ I or, equivalently, l k < δ k , take any . If k ∉ I or, equivalently, l k ⩾ δ k , take any . Now consider a parameter vector , where , for k ≠ i ∈ I, and for i ∉ I ∪ {k}. All such parameter vectors ξ I,k are contained in Θ I , and Thus there exists some such ξ I,k ∈ C 1 and hence C 1 is not contained in this smaller product of intervals.
Finally, H k ∩× k∈T 1 (l k , ∞) = ∅ if and only if l k,I ⩾ δ k for I ⊆ T 1 . if and only if , for I ⊆ T 1 and k ∈ I, if and only if ψ k = 1. Cross-sections of confidence regions of the form (9) for making inference on the secondstage parameter of interest, θ B , in the example of § 3·2: (a) two cross-sections of the 97·5% confidence region; (b) two cross-sections of the 95% confidence region.