Sequential stopping for high-throughput experiments

In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga.


INTRODUCTION
In high-throughput studies (HTSs), the sample size is usually chosen informally. The resulting experiment may either be not informative enough or unnecessarily extensive. To address this problem, we develop a sequential design framework for HTSs. That is, we investigate the question whether the currently available data in a typical HTS suffices and, if not, how to determine the optimal stopping strategy. We focus on experiments to perform group comparisons, although our ideas remain useful for other inferential goals. For simplicity, we discuss the two-group case, but our software allows >2 groups. The proposal is based on Bayesian decision theory, so that decisions are coherent with respect to an underlying utility function and probability model. We emphasize ease of interpretation and use.

Probability model
Our proposal requires extensive predictive simulation and model fitting. Hence, the model must be computationally efficient. For instance, the examples in Section 5 required posterior inference in millions of simulated datasets. On the other hand, the model needs to be sufficiently flexible to capture the important features of the data.
Here we use the GaGa (Rossell, 2009) and log-normal normal with generalized variance (NN) models (Yuan and Kendziorski, 2006) models to illustrate the approach. Both offer a reasonable compromise between flexibility and computational cost. The GaGa model assumes x i j ∼ Ga(α i , α i /λ iz j ). The NN model uses x i j ∼ N (μ iz j , σ 2 i ). The triple θ i = (λ i0 , λ i1 , α i ) (GaGa) or θ i = (μ i0 , μ i1 , σ 2 i ) (in the NN model) incorporates gene-specific variability and gene-by-group specific means. A hierarchical prior on θ i assigns positive prior probability to means being equal across groups. The GaGa hierarchical prior is and P(δ i = 1) = π , independently across i. The hyper-parameters are ω = (α 0 , ν, β, μ, π). The NN hier- , also with probability of ties P(δ i = 1) = π , independently across i. For this model, ω = (μ 0 , τ 0 , ν 0 , σ 0 , π). The GaGa sampling distribution for x i j captures asymmetries that are frequently observed in HTS. The NN assumptions are similar to those of the popular limma approach (Smyth, 2004), which has been found useful in many applications. The supplementary material available at Biostatistics online (Section 3) shows goodness-of-fit assessments that help choose the most appropriate model for a particular dataset.
In terms of computational complexity, conditional on ω the posterior distributions are available in closed form. We treat ω as fixed, avoiding the need for Markov Chain Monte Carlo (MCMC) simulation. This substantially increases the computational speed. We estimate ω via expectation-maximization as in Rossell (2009) and Yuan and Kendziorski (2006). The latter proposed a method of moments estimate for (ν 0 , σ 2 0 ), which can result in overestimating π . We illustrate this issue and outline a simple procedure to adjustπ in the supplementary material available at Biostatistics online (Section 4).
While we use these two models in our examples, the upcoming discussion of the optimal stopping policy remains valid for any alternative probability model.

Pre-processing
We assume that the data are suitably pre-processed. This is critical for meaningful inference. For instance, ignoring batch effects may bias or add uncertainty to group comparisons. We note that some technologies such as RNA-seq may be less sensitive to batch effects, and that these can be partially mitigated by good design, e.g. by balancing the number of samples in each group and batch. We recommend jointly preprocessing data after every batch, as some technical biases (e.g. probe or GC-content biases) may be better assessed once more data are collected.
Batch effects and other sources of variability may be either addressed in the pre-processing or in the analysis by including appropriate terms in the model. Following Yang and Speed (2002) and Durrieu and Briollais (2009), we argue in favor of the former. As an illustration, let y i j be a vector of covariates that are used in the adjustment, and assume that E(x i j |μ iz j , y i j ) = μ iz j + g(y i j ). Here, g(·) captures the effect of y i j on the outcome, and could represent a non-linear adjustment that cannot be captured by the analysis model. One could then use the partial residualsx i j = x i j −ĝ(y ij ) as the pre-processed data, whereĝ(·) is an appropriate estimate of g (·).
We note that the domain of the data must match the assumptions of the model. For example, while most technologies deliver positive expression measurements, pre-processed data may sometimes present negative values that are not allowed in the GaGa model. A simple strategy to deal with negative values is to definex iz j = x iz j + k, where the offset k > 0 ensures thatx iz j > 0. Alternatively, definex iz j = e x iz j , but this option may produce outliers that decrease the model goodness-of-fit. In practice, we recommend trying several transformations and producing some goodness-of-fit plots (e.g. see supplementary material available at Biostatistics online, Section 3).

Decision criterion
We formalize sequential sample-size calculation within a Bayesian decision-theoretic framework. The optimal design is chosen by maximizing the expectation of an appropriate utility function. At each decision time, the expected utility is conditional on all available data (which may be no data at all) and averaged with respect to uncertainty on the model parameters and future data, assuming optimal future decisions.
It is convenient to distinguish sequential and terminal decisions. Sequential decisions correspond to stopping vs. continuation and are made after each batch of observations. Terminal decisions are the classification of genes into EE (δ i = 0) or DE (δ i = 1), and are taken only upon stopping. Let s t = s(x 1:t ) = 1 indicate the sequential decision of stopping at time t and let s t = 0 indicate continuation. Equivalently, s t can be described by the stopping time τ = min{t : s t = 1}. We use s t and τ interchangeably. Let d i (x 1:t ) = 1 (0) indicate the terminal decision to report gene i as DE (EE). Also, let d(x 1:t ) = (d 1 (x 1:t ), . . . , d n (x 1:t )). Both s t and d(x 1:t ) depend on all data available up to time t.
In a fully decision-theoretic approach sequential and terminal decisions are chosen to jointly maximize the expected utility. Instead, we assume a fixed rule for d(x 1:t ) and focus on the optimal choice of s t only. We take terminal decisions using the Bayes rule of Müller and others (2004) to control the posterior expected false discovery rate (FDR) below some specified level. The posterior expected FDR is is the number of reported positives. We use the 0.05 level throughout.
Sequential stopping decisions s t are based on a utility function with sampling cost c and a unit reward for each correctly identified DE outcome (3.1) The second term in (3.1) is the number of true positives (TPs). The cost c is the minimum number of TPs that make it worthwhile to obtain one more sample. This interpretation allows for easy elicitation of c, without any reference to the formal mathematical framework. The utility function (3.1) focuses on statistical rather than biological significance, as the size of the effect is not considered. A simple alternative is obtained by substituting |μ i1 − μ i2 |δ i d i (x 1:τ ) in the summation in (3.1); see Müller and others (2004) or Rice and others (2008) for other interesting alternatives. The upcoming discussion is independent of the specified utility.

Optimal rule
The optimal stopping decision s t maximizes u(·), in expectation over all unknowns, including parameters (θ, δ), and future data x τ +1:T . An exact solution requires dynamic programming, also known as backward induction (DeGroot, 2004). At time t, the optimal decision is to stop if the posterior expected utility where P(δ i = 1|x 1:t ) is the posterior probability that outcome i is DE. The expectation is with respect to δ only, as we fix the terminal decision d(x 1:t ). The posterior probability P(δ i = 1|x 1:t ) can be computed in closed form for some models (including GaGa and NN) or can easily be estimated from the MCMC output. Evaluatingū t (s t = 0, x 1:t ) is more challenging. An exact solution requires assessing the expected utility for all possible future data trajectories x t+1 , . . . , x T , substituting the optimal decisions s t+1 , . . . , s T . The computational cost is prohibitive.

APPROXIMATION BY OPTIMAL DECISION BOUNDARIES
Berry and others (2001), Brockwell and Kadane (2003), DeGroot (2004), and Müller and others (2006) discuss alternatives to an exact optimal sequential solution. Following Rossell and others (2006), we define sequential stopping boundaries. We restrict the maximization to rules that depend on the data x 1:t only through a summary statistic S t and linear boundaries that partition the sample space. We propose using is the one-step ahead increase in the expected utility and E x t+1 (·|x 1:t ) conditions on x 1:t and marginalizes with respect to future data x t+1 . For (3.1), we find t U = t (TP) − c, i.e. t U is the expected increase in the TPs, and decision boundaries can equivalently be written in terms of t (TP). Consider the example in Figure 1. The thick black line is a decision boundary. Every time we observe new data, we compute t (TP). If t (TP) lies above the boundary, we continue experimentation, otherwise we stop. That is, we experiment as long as enough new TPs are expected.
Let b = (b 0 , b 1 ) be the intercept and slope defining the linear boundaries, and U (b, x 1:t ) be the associated expected utility given data up to time t. In other words, U (b, x 1:t ) is the expected utility conditional on x 1:t when the stopping decision is based on a decision boundary indexed by b. Algorithm 1 details a forward simulation algorithm (Carlin and others, 1998) to evaluate the required expectations, and a grid search to carry out the maximization of U (b, x 1:t ) with respect to b. The algorithm assumes that t samples are already available. For no data use t = 0 and x 1:t = ∅ (see Section 5.3).
(2) Grid search. For each b, find the stopping times τ ( j) for all saved trajectories t (TP) ( j) .
1:τ ( j) ).  Figure 1 shows simulated t (TP) as gray lines. For each boundary b, we determine the stopping time for each trajectory and average the expected terminal utilities. At t = T − 1, we do not determine stopping using b but the optimal rule T −1 U > 0.
In principle b can be recomputed every time new data are observed. Recomputation can help to decide between multiple optima and update P(x t+1:T | x 1:t ). In our examples, we determine b only once, either based on a pilot dataset or prior knowledge, but we indicate the usefulness of recomputation when appropriate.
In addition to the intuitive appeal, some theoretical considerations motivate our approach. First, fixedsample designs are special cases, e.g. b = c(4.5, ∞) results in a fixed sample size of 5. The myopic rule of continuing as long as t U > 0 (Berry and Fristedt, 1985, Chapter 7) is the special case b = (c, 0). We generalize the idea with an arbitrary boundary on t U. An important assurance is that t TP converges to 0 as t → ∞, which guarantees eventual stopping; see supplementary material available at Biostatistics online (Section 1) for a formal statement and proof.

EXAMPLES
We compare our approach and the fixed sample designs of Müller and others (2004) in several important experimental conditions. The supplementary material available at Biostatistics online discuss preprocessing and goodness-of-fit (Section 3) and an additional RNA-seq example with the R code (Section 4).

Simulated microarray study
We plan collecting data in batches of 2 arrays per group, with a maximum of 20 per group (i.e. T =10 batches). Recall that c is the minimum number of new DE genes that compensate the cost of one more batch. We consider c = 25, 50, and 100. To keep the simulation realistic, we estimated the hyper-parameters based on data from a study of leukemia microarray data (Armstrong and others, 2002). We focus on 24 acute lymphoblastic leukemia (ALL) and 18 mixed-lineage leukemia (MLL) trans-location samples. The Table 1. Simulated data. t * F : fixed sample size; t * S : average sequential sample size; U * S − U * F : expected utility for sequential design minus the expected utility for a fixed sample We find optimal strategies based on π =π, but we assess performance under model mis-specification by also simulating data under π = 0.5π and π = 2π , while leaving π =π unchanged in the analysis model. We obtained 250 simulations under each scenario. Figure 1 shows the optimal boundary for c = 50 and simulated t (TP) (gray lines) under the GaGa model. Table 1 reports expected utilities and stopping times. The optimal fixed sample sizes for c = 50 under the GaGa and NN models are t * F = 5 and t * F = 4, respectively. When π =π , the expected sequential sample sizes are 5.0 and 4.1 (respectively) and there is no gain in the posterior expected utility. Sequential designs offer little advantages when the data match the prior expectations. However, when prior expectations are unrealistic, sequential designs adapt to the observed data. When π was overstated by the prior (π = 0.5π ), sequential designs stopped earlier than the fixed sample-size designs. Conversely, when π = 2π , they stopped later so that more DE genes could be found. For instance, for c = 50 the GaGa sequential design requires 4.1 data batches when π = 0.5π and 6.9 when π = 2π . The fixed design always requires 5.

High-throughput sequencing example
We use a pilot RNA-seq dataset with two muscle and one brain human samples to design two hypothetical studies. Study 1 compares gene expression for muscle vs. brain. Many DE genes are expected. Hypothetical Study 2 compares the two muscle samples. No genes should be DE. In both cases, we use one sample per group as pilot data. We consider up to T = 5 more samples, in batches of one sample. The GaGa model provided a reasonable fit to these data (supplementary material available at Biostatistics online, Section 3).
We determined the optimal boundary for sampling costs c = 0, 1, . . . , 100. Figure 2(a) shows that t (TP) is maximal for t = 2 additional data batches. As suggested by Theorem 1 (supplementary material available at Biostatistics online, Section 1), the incremental reward decreases as t grows further. The dashed boundary shows that, for c > 66, the optimal decision is to stop experimentation. For c 66, there are multiple optimal b * . The solid black lines show two optima. In both cases, the decision at t = 0 is to continue. Since the simulated trajectories do not cross either boundary, we expect experimentation to continue up to T = 5. The future realized t (TP) might cross the boundary, in which case the design would adapt and stop experimentation before T = 5. Given that the pilot data contains one sample per group, we would re-determine b upon observing new data. The hypothetical muscle vs. muscle comparison simulation is shown in Figure 2(b). In this case, t (TP) is negligible and the optimal design is to stop at t = 0 (i.e. not to collect any further data) for any c > 2. The result seems sensible as no DE genes are expected.

Microarray example
We consider the leukemia study of Campo Dell'Orto and others (2007) recording mRNA expression for 21 ALL and 15 MLL patients and 54 675 genes. We consider designing the study before any data were available. In such circumstances, one could estimate the hyper-parameters ω from a similar study. We used the Armstrong and others (2002) study (Section 5.1) as it was also carried on ALL/MLL patients and used the same microarray platform. Once fixed and sequential designs were determined, we used the historical data to compare performance. We use batches of two arrays per group, maximum T =7 batches, and c = 50.
The white bars in Figure 3 (left panels) show the expected utility under the GaGa and the NN priors for all fixed sample sizes. The optimal fixed sample sizes are t * F = 5 batches (GaGa) and t * F = 4 (NN). The right panels show the optimal boundaries and t (TP) computed from the observed data up to time t = 1, . . . , 7. For both models, the sequential design continues up to the time horizon. Figure 3 compares the designs by computing the posterior expected TPs (gray bars) and the number of genes with limma P-values < 0.05 after the Benjamini and Yekutieli (2001) adjustment (black bars). At the time horizon both quantities increase over 2-fold compared with the recommended fixed sample size. The differences between prior and posterior expected TPs show how sequential designs adapt to the observed data to correct prior mis-specifications.

Reverse-phase protein arrays
We design a follow-up study for the RPPA dataset dataIII that is included in the R package RPPanalyzer (Mannsperger and others, 2010). The data contain expression for 75 proteins and 35 stage 2 and 25 stage 3 samples. Both models, the NN and GaGa models provide a reasonable fit (supplementary material available at Biostatistics online, Section 3). The fit under the NN model is slightly better. We find π = 0.13 under the NN model, and π = 0.10 under the GaGa model; that is, the estimated number of DE proteins is 9.75 and 7.5, respectively. Although we expect several DE proteins, at a posterior expected FDR < 0.05 the NN model calls one DE protein, and the GaGa model makes no DE calls. For comparison, only one protein has limma BY-adjusted P-values below 0.05. We consider adding batches of 50 samples per group, up to a maximum of T = 4. We set the sampling cost to c = 1, reflecting that RPPA samples are relatively cheap. The study focuses on 75 carefully chosen proteins. Figure 4 shows simulated t (TP) trajectories and the optimal boundaries. Inference under the GaGa model estimates fewer TPs. Otherwise, inference is fairly similar across models and the optimal boundaries are remarkably close. The average sample size is t * S = 1.37 under the NN model, and t * S = 1.32 under the GaGa model. The expected number of TPs at t * S is 9.46 (NN) and 6.11 (GaGa); that is, according to both models, most DE proteins should be detected by adding 1-2 batches, i.e. 50-100 samples per group. These results help assess the potential benefits of extending the experiment.

DISCUSSION
We proposed a sequential strategy for massive multiple hypotheses testing. An important advantage lies in the generality of the proposed design. We discussed three RNA-seq, one microarray, and one RPPA experiments. Sequential designs are robust with respect inaccurate prior guesses and provide substantial advantages over fixed sample designs.
The proposal is formulated in a decision-theoretic framework and emphasizes interpretability. We monitor the one-step ahead expected increment in utility and stop the experiment when it falls below a boundary.  The approach includes fixed sample size and myopic designs as special cases. We use terminal decisions that control the posterior expected FDR. While inconsistent with a strict decision-theoretic setup, where all decisions are taken to maximize the expectation of a single utility, we feel that our choice offers a pragmatic compromise. The method allows stopping when only one or two samples are available, which requires making strong parametric assumptions. For instance, in Figure 3, the posterior expected TPs and t (TP) based on two samples per group differ widely between the GaGa and NN models. Nevertheless, both models correctly indicate to continue and show good agreement for subsequent samples. Whenever possible, we recommend using a minimum burn-in (e.g. 3 samples) before starting sequential stopping. When not feasible, we recommend assessing the goodness-of-fit carefully and updating the forward simulation when more data are available.
We focused on group comparison experiments, but the framework can serve as the basis for other HTSs. Interesting extensions include classification, clustering, or network discovery studies. These would require adjusting the utility function and possibly the probability model, e.g. to capture strong dependencies between outcomes.
Sequential designs are most appealing in moderate to large studies, where technical limitations require gathering data in batches. They should also prove valuable when samples are costly to obtain or there are ethical considerations, e.g. in human studies. Overall, they help save valuable resources and guarantee that sufficient data are collected to answer the scientific questions.

SOFTWARE
An implementation of the proposed approach was added to the Bioconductor package gaga.