Interim recruitment prediction for multi-center clinical trials

Summary We introduce a general framework for monitoring, modeling, and predicting the recruitment to multi-center clinical trials. The work is motivated by overly optimistic and narrow prediction intervals produced by existing time-homogeneous recruitment models for multi-center recruitment. We first present two tests for detection of decay in recruitment rates, together with a power study. We then introduce a model based on the inhomogeneous Poisson process with monotonically decaying intensity, motivated by recruitment trends observed in oncology trials. The general form of the model permits adaptation to any parametric curve-shape. A general method for constructing sensible parameter priors is provided and Bayesian model averaging is used for making predictions which account for the uncertainty in both the parameters and the model. The validity of the method and its robustness to misspecification are tested using simulated datasets. The new methodology is then applied to oncology trial data, where we make interim accrual predictions, comparing them to those obtained by existing methods, and indicate where unexpected changes in the accrual pattern occur.


INTRODUCTION
Efficiently recruiting patients to clinical trials is a critical factor in running clinical trials and hence delivering new medicines to patients as quickly as possible. Late-stage clinical trials are commonly run across many sites, and successfully managing and running trials and subsequent processes require accurate forecasts of trial recruitment.
Early recruitment rates can be high, for example, because patients with the required condition are already available, and rates can then drop once these patients have been recruited. Deterministic approaches and ad hoc techniques may yield simplified and, often, overly optimistic recruitment timelines, a phenomenon thus dubbed Lasagna's Law (Lasagna, 1979). For example, 48% of centers studied by Getz and Lamberti (2013) failed to enroll the required number of patients in the time originally allocated, leading to extensions of the recruitment timelines and the need to bring more centers into the study, which itself is a costly process. The timelines are usually pushed to nearly twice the originally proposed plan. The most frequent reason for trial discontinuation appears to be poor recruitment; out of 253 discontinued trials studied in Kasenda and others (2014), 101 were terminated due to under-recruitment.
This motivates the need for robust statistical methods for modeling and predicting the recruitment to clinical trials at site level. Early detections of possible center underperformance may allow practitioners to swiftly intervene in the operations. It can also provide realistic timelines for the completion of different stages of the trials.
In this work, we introduce a novel flexible framework for effectively modeling and predicting patient recruitment. We will focus on the oncology therapeutic area as it is known for sparse enrollments whose patterns are not sufficiently captured by the state-of-the-art methods (Anisimov and Fedorov, 2007;Lan and others, 2018). Our framework utilizes time-varying recruitment rates whilst also permitting variation between recruitment centers. Inference is based on the set of known center initiation times to date, whilst the prediction is conditional on a set of future initiation times. Past initiation times are known, but typically, whilst there is a plan for future initiation times along with potential contingencies, the actual times are not known precisely in advance. The proposed methodology can be used with user-specified initiation schedules to facilitate the choice between different initiation-time scenarios, or it can be combined with a center-initiation model. Predictions of future recruitment incorporate parameter and model uncertainty, which is essential when data are limited.
Existing methods for predicting recruitment to clinical trials are overviewed in Section 2. Section 3 outlines methods for detecting recruitment rate decay in the multi-center recruitment setting along with result of a Monte Carlo power study. Section 4 introduces the flexible modeling framework, and Section 5 presents a general method for choosing sensible Bayesian parameter priors, along with an appropriate posterior sampling method and diagnostics. A simulation study is presented in Section 6, illustrating the fitting of the model, model validation, and forecasting recruitment using Bayesian model-averaging. In Section 7, the model is fitted to an oncology dataset, and this is followed by a discussion in Section 8.

EXISTING METHODS
The first statistical modeling framework for clinical trial recruitment was introduced in Lee (1983), where the recruitment was assumed to be a constant-rate Poisson process, leading to tractable inference based on interim data. Williford and others (1987) built on the model by considering Bayesian inference with conjugate priors. Gajewski and others (2008) and Jiang and others (2015) further explored the effects various prior densities can have on predictions. Time-inhomogeneous accrual was first considered in Piantadosi and Patterson (1987), where the aggregated accrual across all sites was modeled as an inhomogeneous Poisson process with intensity λ(t) = ζ(1 − exp{−κt}), ζ , κ > 0. Zhang and Long (2010) took a nonparametric approach, using B-splines to model the trends in accrual and using the intensity value at the census time for predictions. Tang and others (2012) proposed a Poisson model with a piece-wise linear intensity which captured aspects of recruitment such as slow initial recruitment and a spike in recruitment close to the end of the trial. For a more thorough review of these as well as other methods, see Heitjan and others (2015). Accrual-only modeling methods do not consider the effect that initiating new centers can have on recruitment trends. For that reason, we shall focus on methods which can take advantage of center-specific recruitment data. Anisimov and Fedorov (2007) introduced the Poisson-gamma (PG) model of recruitment in a multicenter setting, with the main appeal being the use of random effects for the recruitment rates of centers, providing a tractable, data-driven prior predictive distribution for recruitment in yet-unopened centers. The model consists of C centers, each recruiting N c patients over τ c days, c = 1, . . . , C. The framework makes the following distributional assumptions, The random effect λ c is the recruitment rate for center c. The rates, and thus the center recruitments, are assumed to be independent conditional on α and φ. There are, however, several caveats with the approach taken. The article advocates using the Empirical Bayes approach, that is, maximum likelihood estimation for the hierarchical parameters (α, φ) followed by re-estimation of the distribution of random effect λ c given α, φ, and n c , for each center. A method for obtaining the uncertainty in the hierarchical (α, φ) parameters is provided, but this uncertainty is not accounted for when making predictions, leading to overly confident prediction intervals. However, the main issue which could result from employing the model arises from the strong assumption of time-homogeneity of center recruitments, which can lead to underestimations of the time to completion. Figure 1 shows the accrual in a simulated trial where the rates gradually decay with time as well as the predictive distribution of the PG model fitted at a census time of three-fifths of the total length of the study; the initiation day for each center is marked. The accrual appears to follow a straight line which could initially suggest using a time-homogeneous model. However, new centers are constantly being initiated so that a constant recruitment rate for each center leads to an upward arching trend in accrual. This is encapsulated by the fitted predictive. Here, the accrual is initially badly underestimated and then grossly overestimated after the census time. The apparent "matching" at the census time is due to predictions using re-estimated random-effect distributions. Lan and others (2018) describe the first multi-center recruitment model in which the rates decrease over time. The model assumes inhomogeneous Poisson for arrivals center c with an intensity of the form where λ o c is a gamma random effect, as in (2.1), and t o a user-specified parameter and is not estimated as part of the inference. By enforcing the specific intensity-form, the possibilities of time-homogeneous recruitments or even intensity decays with heavier tails are excluded. A more systematic alternative is to start by testing the time-homogeneity assumption.

DETECTING TIME-INHOMOGENEITY
Given series of daily center recruitment counts over the recruitment period of τ c days, {N c (t)} τc t=1 , c = 1, . . . , C, we can test the hypothesis of time-homogeneity. To detect a decay in the rate, we only need to use the sums X (c) , whose expectations we denote by μ (c) 1 and μ (c) 2 , respectively. Detecting time-inhomogeneity in a single center can be difficult as the infrequent counts will lead to low powers of tests (Krishnamoorthy and Thomson, 2004) (see also Tables 1  and 2). Thus, we combine the recruitments across all centers leading to two counts: X 1 = C c=1 X (c) 1 and X 2 = C c=1 X (c) 2 , and we choose our hypotheses to be H 0 : The tests are one-sided as we are only interested in recruitment which decays over time. We consider tests with respect to the following assumptions: Assumption 1: For each center c = 1, . . . , C, the counts in the first and second halves of that center's recruitment period are independent and have the same distribution, X (c) 2 , with expectation μ (c) 1 . Furthermore, the recruitments at each center are independent of each other.
Assumption 2: The patients arrive according to a Poisson process such that X (c) for some μ (c) 1 , c = 1, . . . , C. Assumption 1 implies that X 1 and X 2 must have the same distributions, with respective expectations μ 1 = C c=1 μ (c) 1 and μ 2 = C c=1 μ (c) 2 being equal. Assumption 2 further implies that the distributions must be Poisson. Figure 2 shows the construction of the quantities X 1 and X 2 by aligning the centers of the recruiting periods. The splitting of the series halfway is arbitrary, though splitting it in half (or at least close to this) would theoretically yield the highest power. It assumes that the τ c are even. However, centers recruiting over odd numbers of days can still be used by removing the middle day observation. This reduces the power of the tests, though the reduction is negligible. Gu and others (2008) offer a detailed Monte Carlo study of the different methods used for testing for a difference in means of two Poisson variables. Here, we focus on the ones most applicable to the clinicaltrial recruitment setting, bearing in mind statistical power and robustness. We identified two methods: the non-parametric bootstrapped test (BST), which is powerful yet robust, and the Poisson likelihood-ratio test (LRT), which makes stronger distribution assumptions to achieve an even higher power. The BST only assumes that the counts in each day are independent and identically distributed (Assumption 1). With this assumption, resampling within each center with replacement, from the original data would still produce a valid sample from the assumed distribution under H 0 . A large number of bootstrap samples is used to simulate the distribution of the difference in two means, which is then used to test the hypothesis. Appendix A of the Supplementary material available at Biostatistics online details the sampling procedure for obtaining the distribution and the p-value.
For the LRT, we require Assumption 2, which is already an underlying assumption for the model in Anisimov and Fedorov (2007). Upon aggregation, the two sums follow Poisson distributions, that is, X 1 ∼ Pois(μ 1 ) and X 2 ∼ Pois(μ 2 ). The likelihood under the null model (μ 1 = μ 2 ) is compared to the likelihood under the alternative two-mean model (μ 1 > μ 2 ). Here, the likelihood function is We let whereμ is the MLE under the null, andμ 1 andμ 1 are the MLEs under the alternative hypothesis. Under the null, we would expect the test statistic T L (X 1 , X 2 ) to asymptotically be zero half the time with the other half following a χ 2 1 distribution (Robertson and others, 1988), When using the LRT, the simulated significance levels can differ from the pre-specified level when μ values are low. This is due to using the asymptotic χ 2 distribution when calculating the p-value (Gu and others, 2008).
The performance of the two tests was assessed by carrying out a Monte Carlo study. Test powers were estimated using Poisson data with different expectations and ratios, R = μ 2 /μ 1 . For the LRT power estimates, 5 × 10 6 samples were used as the test itself is very computationally cheap. For the BST, 5 × 10 4 samples were used, with each test using a bootstrapped distribution of size 10 3 . Tables 1 and 2 show the results of the study. The biggest difference in powers occurs for lower expectations, with the LRT outperforming BST. It must be noted, however, that the BST only requires the data to be i.i.d. within each center and thus is robust to violations of the Poisson assumption; if the counts within each center are overdispersed, for example, it does not affect the Type I error.
To exemplify the usefulness of this test, we can consider an interim likelihood ratio test where the expected number of enrollments is 170. This corresponds to E[X 1 ] = 100 and R = 0.7, for example, and results in a statistical power of approximately 0.75. Considering many trials require an upward of 500 enrollments, informed decisions can be made relatively early on in the trial.

PROPOSED MODEL
We consider a scenario of C centers recruiting patients, with each center c being initiated for τ c days. The number recruited by center c on day t shall be denoted by N (t) c . We propose the following modeling framework for the multi-center clinical-trial recruitment, based on the inhomogeneous Poisson process, where g is a non-negative function which dictates the curve-shape of the intensity and θ is a parameter (or parameter vector) associated with the functional form. We use the (α, φ) parametrization for the hierarchical gamma distribution as it leads to orthogonality of α and φ in the Poisson-gamma model (Huzurbazar, 1950).
Marginalizing over the random-effect component gives whence the full likelihood of the model given the recruitment data is: If all the centers had been recruiting for the same amount of time, that is, τ c ≡ τ ∀c, then by fixing the integral of g(t; θ) over τ days we could introduce orthogonality between (α, φ) and θ by imposing the normalization: τ 0 g(t; θ) dt = τ . This generalizes the homogeneous model with g(t; θ) = 1 and leads to the following factorizable likelihood, where n = C c=1 n (·) c . The factorization means that now the θ parameter describes the shape of the intensity only, and α and φ describe the distribution of the magnitude of the integrated intensity, leading to a more interpretable model. Even when centers are not all recruiting for the same length of time, we choose to impose a similar normalization using some representative τ , here 1 C C c=1 τ c . As demonstrated empirically in Section 6, the condition leads to approximate orthogonality even when the centers are initiated uniformly throughout the study.

Intensity curve-shape
In this work, we will restrict our choice of curve-shape g to parametric forms. The functional form of g is arbitrary and the best choices may depend on the context of the problem. When working with oncology datasets, for each center we observe low-frequency counts which seem to become even less frequent over time but with varying tail behaviors. For this reason, we chose the following curve-shape The proportionality is used as multiplying g κ by some positive constant and dividing φ by the same constant leads to the same model. The limit as κ → 0 recovers the standard PG model (2.1); and letting κ → ∞, we obtain an exponential tail. The full (normalized) forms are then The associated integrated forms, G κ (t; θ) are provided in Appendix B of the Supplementary material available at Biostatistics online.
The flexibility of the model, however, can result in potential identifiability issues. Inference methods, such as maximum likelihood, can run into numerical instabilities when κ >> 1 > θ or κ < 1 << θ (see Appendix B of the Supplementary material available at Biostatistics online for details). For this reason, we recommend restricting the choice of κ to a discrete set of values; in this work, we use {0, 0.5, 1, 2, ∞}. This will be elaborated on in Section 5.3.

INFERENCE, DIAGNOSTICS, AND PREDICTIONS
We aim to construct a framework which can provide reliable predictions whilst capturing uncertainty in the estimated parameters and in the underlying model itself. We employ the Bayesian paradigm since it naturally incorporates the distribution of the random effects, λ c , with the uncertainty in the model and the parameter values. However, we note that in some scenarios frequentist methods may be preferred and give a brief outline of how one may employ them in Appendix C of the Supplementary material available at Biostatistics online.
Given a parametric statistical model, the Bayesian paradigm starts from a prior distribution for the parameters, here denoted π 0 (α, φ, θ) and updates this according to some data, y, to provide a posterior distribution, here denoted by π(α, φ, θ|y). When multiple parametric models, M k , k = 1, . . . , K, are being considered, the posterior probability for model k, here denoted by π p (M k |y), may also be calculated. Section E of the supplementary material available at Biostatistics online provides more details on these quantities; see also Robert and Casella (2013) or Gelman and others (2013), for example.
For the models under consideration for trial-recruitment data, neither the posterior model probabilities nor the posteriors for the parameters for any particular model are tractable, and so we employ importance sampling to obtain Monte Carlo samples (α m , φ m , θ m ), m = 1, . . . , M from the posterior distribution for any given model, as well as an estimate of π(M k ), k = 1, . . . , K. Appendix D of the Supplementary material available at Biostatistics online provides further details of this method, as well as of effective sample size (ESS), a diagnostic which indicates the reliability of the Monte Carlo estimates; see also Robert and Casella (2013) or Doucet and others (2001).
In Sections 6 and 7, we carry out inference onα = log α,φ = log φ, andθ = log θ since analyses of trial data showed the likelihood in the log-parameters to be more symmetric about the mode, which can make sampling more efficient. For the importance sampling proposal distribution, we use a multivariate t-distribution on four degrees of freedom, with the same mode as the posterior and the shape matrix equal to the inverse Hessian at the posterior mode.

Prior choices
We base our prior specification on a maximum likelihood meta-analysis of 20 oncology clinical trial recruitment datasets. The trials studied were for seven different types of cancers: ovarian, prostate, breast, small and non-small lung, bladder, and pancreatic. The number of centers ranged from 58 to 244 with a median of 140 and total enrollments ranged from 245 to 4391 with a median of 1035. In all cases, the parameter estimators were close to orthogonal justifying the use of independent priors: π 0 (α,φ,θ) = π 0 (α)π 0 (φ)π 0 (θ).
We found that the α parameter does not change much from one study to another. The weakly informative priorα ∼ N (0.2, 2 2 ) sufficiently reflects the distribution of the estimated values.
The φ parameter estimates varied by orders of magnitude between studies. The parameter reflects the mean center recruitment and is well identified by the data; it depends upon the catchment region, type of indication and protocol, for example. For this reason, we advocate using a vague prior unless reliable expert knowledge is available. In our analyses, we used the uninformative, proper priorφ ∼ U (−8, 8).
The difference between the homogeneous (4.5) and the inhomogeneous (4.6, 4.7, 4.8) models is the curve-shape parameter θ. Lindley's paradox (Lindley, 1957) warns that assigning θ a vague prior can lower the posterior probabilities of the models that use θ , compared to the model with κ = 0 which does not use θ . To avoid the paradox, we set an informative but sensible prior by considering the drop off in intensity after some time, t 0 . We let R κ = g κ (t 0 ; θ)/g κ (0; θ) and set R κ ∼ Beta(a, b) a priori, with a = b = 1.1 to indicate a lack of information, excepting that this is not a constant intensity model, since this is covered by κ = 0, and that we do not expect a 100% drop off after a time of t 0 (expert opinion); here we take t 0 = 4 months. As R κ is a monotonic function of θ , we can use a density transform to derive the corresponding prior for θ . If prior information is abundant, be it in the form of historical data or expert knowledge, the beta distribution parameters can be adjusted to reflect this. Given (4.4), the resulting prior density forθ is given in Appendix E of the Supplementary material available at Biostatistics online.

Predictive distribution
There are two complementary properties for which predictions might be required: the distribution of future recruitments within a set time interval, and the distribution of time until the target number of recruitments is reached. In this section, we focus on the former; details of the latter appear in Appendix F of the Supplementary material available at Biostatistics online.
Suppose we are interested in sampling the recruitment, denoted N + c , at some day t + by center c. Given samples from the parameter posteriors, we can sample exactly from the posterior predictive for N + c by exploiting the Poisson-gamma conjugacy of the random-effect distribution. The posterior distribution for the λ o c random effect for center c is where α * c = α + n (·) and φ * c = φ × α+n (·) c α+φG(τc;θ) . The predictive distribution for N + c conditional on the random effect is: where G + θ = t + t + −1 g(s; θ) ds.
Marginalizing over the random effect posterior, we arrive at the negative binomial distribution: The length of interval to t + does not need to be a day and could instead be a week or a month, depending on the context of the application. To obtain the full marginal predictive, we sample the recruitments conditional on parameters sampled from the posterior. For as yet unopened centers, we set n (·) c = τ c = 0. For each triplet (or couplet, if κ = 0) of parameters sampled from the posterior, we sample N + c , c = 1, . . . , C, and sum them to obtain a sample from N + |α, φ, θ . The collection of these sums is a sample from the posterior predictive distribution for the model.
If simulations for multiple distinct time periods are required for a given center, c, as needed for the accrual curve for example, then we first sample λ o c from its posterior (5.9). We then simulate the Poisson counts for the individual time periods, which are conditionally independent given λ o c , from (5.10).

Model averaging
When predicting the enrollments using a fitted model, we implicitly assume that a single model best reflects reality; however, prediction methods should consider the uncertainty in the models used for inference. We shall, therefore, use model averaging for making predictions, that is, take a weighted average of predictions made by each model. Working in the Bayesian paradigm provides us with an intuitive choice for weights in the form of marginal likelihoods of the models.
The averaging framework fits in with the restriction of the shape parameter κ to a discrete space. Each κ value generates an inhomogeneous Poisson-gamma model with the tail behavior of the associated intensity shape. This includes the null (κ = 0) model as in Anisimov and Fedorov (2007). In this work, we set all prior model probabilities equal.

Model validation
Before making any statements in regards to the future recruitments, we should validate that the fitted model does indeed capture the true data-generating process sufficiently well. Since the true process is unknown, we compare the observed data to the modal model (the model with the highest posterior probability) fixed at posterior parameter means (α,φ,θ). Firstly, we wish to assess that the chosen hierarchical structure is reflected in the data. The distribution of posterior means of the individual random effects should approximately follow the hierarchical Gamma(α,α/φ) distribution. A QQ-plot can be used to visually compare the distributions. If deemed sufficiently similar, using the distribution for generating predictions for yet-unopened centers is appropriate. If the distributions are noticeably different, particularly if the true distribution is multimodal, any interim predictions for yet-unopened centers could (but need not; see robustness study in Section 6) be inaccurate. According to the model, the counts in any initial period [0, t ] (such as the first month) of each center's recruitment period, follow a negative binomial distribution with shape parameter α and success probability φG(t ; θ )/(α + φG(t ; θ)), similar to that given in (5.11) but using α and φ in place of α * c and φ * c . As the true parameters are unknown, we compare it to the distribution fixed at point-estimates (α,φ,θ). The diagnostic indicates if the combination of the gamma random effects and the modal decay model captures the behavior over the initial period after center initiation. Again, a QQ-plot can be used for comparing the theoretical distribution to the observation, giving an indication if the fitted model under-or overestimates initial recruitment. The initial period, [0, t ], should be long enough that the true recruitment decay should be apparent. However, since only centers that have been recruiting for a period of at least t can be used for the diagnostic, to ensure a reasonable power, t should be short enough that a large number of sites have been recruiting for this duration. In this work, we set t = 60 (2 months).

SIMULATION RESULTS
We demonstrate our flexible framework through a simulation study, using simulated data sets to illustrate model fit and prediction and to highlight the effect model misspecification can have on predictions. In practice, patterns in center initiation times can vary greatly between trials. For presenting the methodology, we consider an initiation schedule similar to that observed in a typical trial. We test the robustness of the method using a uniform initiation schedule, with another type of schedule examined in Appendix G of the Supplementary material available at Biostatistics online.
Our historical data set do not include the initiation times of the centers, so instead, to accurately reflect the historical data used in the meta-analysis and what is often available to researchers, we take the first recruitment time of a center as its initiation time and adjust the models to include a single deterministic recruitment at the initiation time of each center followed by stochastic recruitment as described in Section 4.
We simulate a study over a course of 600 days, with 200 centers. The parameters used for simulations were α = 1.4, φ = 0.01, κ = 2.7, and θ = 0.02. The inference is carried out on data observed in the first 360 days. As motivated in Section 1, we condition the inference on a set of known initiation times, chosen by the practitioner; these could subsequently be varied to investigate the impact of different schedules or initiation models. We consider a set of models with flexible tails (Section 4.1) allowing κ ∈ {0, 0.5, 1, 2, ∞}, thus including the null model (Anisimov and Fedorov, 2007). The "normalization" of the curve-shapes was imposed atτ = 1 C C c=1 τ c . We purposely simulated using a κ value outside of those considered in our models to illustrate the flexibility of the framework. For Bayesian inference, we used parameter and model priors outlined in Sections 5.1 and 5.3, respectively. Based on the model fitted to the data at the census day 360, we wish to predict the daily accrual until day 600.
Performing the LRT and BST from Section 3, we find the p-values of both tests to be < 0.001. Table 3 provides the fits for the five models. The effective samples sizes are high, which means that each of the model posteriors is represented well by its respective sample and that the marginal likelihood estimates are accurate. If the ESS values had been low, we would have retried using more samples in the importance sampler. We see that model corresponding to κ = ∞ has the highest posterior probability. A trellis plot of the posteriors for (α,φ,θ) from the modal model (see Appendix G of the Supplementary material available at Biostatistics online) confirms at least approximate pairwise orthogonality between the parameters, as  Fig. 3. Simulated accrual data from Section 6 with Bayesian model-averaged forecast predictive mean of the accrual (solid, red) and 95% prediction bands (red, dashed). Prediction bands are based on the 2.5% and 97.5% quantiles. The forecast begins from a point marked by the red dot and the "+" symbols on the abscissa indicate center initiation times.
anticipated from Sections 4 and 5.1. QQ-plots for the modal model comparing the hierarchical gamma distribution to the posterior means of the random effects and comparing the observed recruitments over the first two months of each center's recruiting period to the model's negative binomial distribution both show approximate straight lines with unit gradient and are provided in the Supplementary material available at Biostatistics online. Figure 3 shows the accrual forecast from the census time τ = 360 up to the horizon τ H = 600, superimposed onto the true accrual plot. The forecast is based on the Bayesian model-averaged posterior predictive distribution. The true accrual is contained within the 95% predictive intervals.
Figures 4 and 5 use an earlier census time (τ = 240) to illustrate the issues that can arise when making predictions using maximum likelihood estimation and model selection. The inference was carried out with the same set of candidate models, and predictions were obtained by simulating from the best model (κ = ∞, chosen using AIC) with parameters fixed at the MLEs. As shown in the plots, not accounting for parameter and model uncertainty may lead to overly confident and biased predictions. Simulations with τ = 360 (see Supplementary material available at Biostatistics online) still showed bias due to the choice of a single model, although the contrast with Figure 3 in terms of prediction interval width was less marked.
We repeated the analysis with a different distribution of initiation times, making the center initiations "clump" roughly every 2 months. The resulting forecast predictive distribution can be seen in Figure 6; performance appears to be robust to the type of initiation schedule.
To further test the robustness of the framework, we first consider the random effects λ o c now being generated from a mixture of two gamma distributions We considered data generated using the same α value and curve-shape as before, but now with center initiation times uniformly sampled on the interval. The ratio of gamma expectations was fixed such that φ 2 = 10φ 1 , and the random effect expectation, E[λ o c ] = (φ 1 + φ 2 )/2, was set to 0.01 and then 0.03.  Fig. 6. Simulated accrual data from Section 6 using clumped initiations with forecast predictive accrual mean (solid, red) and 95% prediction bands (red, dashed). Prediction bands are based on the 2.5% and 97.5% quantiles. The forecast begins from a point marked by the red dot and the "+" symbols on the abscissa indicate center initiation times. data, that is, the larger E[λ o c ], the more apparent the discrepancy in the random-effect distribution, and the concomitant predictions, becomes. This is visible in the clearly non-linear diagnostic QQ-plots, and the plotted forecasts (see Supplementary material available at Biostatistics online). The robustness of predictions comes from the fact that the random effects for initiated centers use re-estimated data-driven distributions, reducing the importance of the random-effect prior; thus the main source of forecasting error comes from the incorrect random-effect prior for new centers. Similar plots for the "clumped" initiation schedule, provided in the Supplementary material available at Biostatistics online, show the same pattern. This mixture distribution of random effects represents the (extreme) scenario where roughly half of the centers recruit the vast majority of patients, with the remaining sites recruiting little to none each. When the ratio of the two means is closer to 1, the model still produces reliable predictions.
We also consider the effect of curve-shape misspecification on predictions, generating data using an intensity proportional to the Weibull density function where θ, k > 0. We simulated accrual datasets using the Weibull shape with θ = 30 and k = 1.5, resulting in the highest recruitment rates occurring two weeks after center initiation. The random-effect distribution used α = 1.4 and two different values φ were used: 0.01 and 0.03; Figures 9 and 10 show example forecasts. For lower overall recruitment levels, the model still predicts future accrual well. Forecast inaccuracies due to model misspecifiation become more apparent when larger recruitment rates are used. The same pattern is observed when center initiation times are clumped (see Appendix G of the Supplementary material available at Biostatistics online). diagnostic QQ-plots for the model fitted to data available at time 0.4. They indicate that there is sufficient concordance between the assumed model and observed enrollment giving validity to potential predictions. Figure 13 shows the accrual along with forecasts from four different census times. The predictive bands become narrower and parameter uncertainty decreases at each census as more data become available for inference. After the third census, there is an unexpected jump in accrual followed by a drop around the fourth census time, suggesting a global external factor, such as a change in the protocol. Table 4 shows p-values of the LRT and BST. Initially, when the accrual is still only a small proportion of the total, it is hard to detect the time-inhomogeneity. At later census points, the test outcomes indicate that the rates are not constant. We compare the proposed framework to the standard homogeneous PG model (2.1) as well as a homogeneous Poisson process (HPP) model fitted only to the accrual. We used the same priors as outlined  Fig. 13. Data example in Section 7 with the accrual (black, solid) for an oncology study; colored solid lines are mean predictions from census times, dashed lines are the 95% prediction bands, and the "+" symbols indicate initiation times of centers.   in Section 5.1 for fitting the PG model, and the HPP rate estimate was obtained using maximum likelihood. The methods were compared in terms of the predicted completion time of the recruitment for the study with the sampling details outlined in Appendix F of the Supplementary material available at Biostatistics online. Forecast completion time from 6 different census points and can be seen in Figure 14; the first HPP predictions were centered at 3.67 and 1.84 which were outside the plot's range. The proposed framework produces better point predictions, especially at earlier interim analyses, and more closely represents the true uncertainty. The HPP predictions near the end of the trial are very accurate. At this point, the majority of the centers having already been initiated and have been recruiting for a long period of time. As a result, the total recruitment rates are not changing by much, with the slight decreasing trend offset by the occasional initiation of a new center. This is a coincidence; if the decay rate had been sharper or shallower, or if fewer or more centers had been initiated then the naive overall Poisson process model would not have fitted as well. The underprediction of the completion time by the proposed model at the census time of t = 0.71 is likely a result of the unexpected surge in recruitment at around that time. The surge is examined in more detail in Appendix G of the Supplementary material available at Biostatistics online.

DISCUSSION
We have introduced a general, flexible framework for modeling and predicting recruitment to clinical trials. We suggest two tests for detecting decay in recruitment rates; comparing them both with respect to power and robustness. The particular form of the test statistic allows for a single, simple trial-level test. Alternative forms, such as splitting according to a global time, would either require a test for each center, massively reducing the power, or estimates of all of the individual center intensities which would introduces several layers of additional complexity because of the hierarchical connection between the center intensities. If it were believed a priori that a particular global period would be unrepresentative then this time span, and the concomitant recruitment, could simply be removed, albeit at the cost of lower power. The parametric curve-shape forms chosen for the intensity were based on the features encountered in oncology trials. We found that the model was still robust to moderate model misspecifications in the distribution of the random effect and intensity shape. Other therapeutic areas such as pulmonary or cardiovascular diseases experience more frequent recruitments and different curve-shapes may be appropriate.As shown in Section 6, model misspecification becomes more of a problem at larger enrollment rates. However, with increased frequency, pattern changes in the early months of a center are easier to identify. Using more complex parametric forms, such as Weibull or generalized gamma shape, could lead to more accurate predictions. Alternatively, if covariate information is available, say x c for each center, the following intensity form motivated by hazard models from survival analysis could be used: λ c (t) = λ o c exp{β x c }g t; exp{η x c } , where λ o c are now random effects coming from a Gamma(α, α) distribution and β and η are vectors of unknown parameters.
As seen in the data example in Section 7, there can be external factors modulating the overall accrual. This could potentially be modeled via a short-term, constant global intensity modifier, which would maintain tractability. The framework is not constrained to parametric forms; non-parametric intensity models, such as those using B-splines (e.g., Morgan and others, 2019) or Gaussian processes (e.g., Adams and others, 2009), could be used instead. This, however, would make the intensity extrapolation problem more difficult.
For curve-shape parameter prior construction, our choice of the quantity of interest R κ was motivated by simplicity of the form; one could just as well have used Gκ (t 0 /2;θ) Gκ (t 0 ;θ) , albeit with more algebraic manipulations. The general method was aimed at models with monotonically decreasing intensities. If curve-shapes such as Weibull are considered then constructing sensible priors will be more complicated.
In presenting the method, we condition the inference and prediction on known initiation schedules for the centers. Incorporating stochastic center initiation models, such as those in Anisimov (2009) and Lan and others (2018), into the Monte Carlo prediction framework is straight-forward, but would complicate the presentation of our methodology without adding novelty. In Appendix H of the Supplementary material available at Biostatistics online, we demonstrate how recruitment can be predicted using our methodology when there is uncertainty in the initiation schedule. For illustration, we imagine a Weibull-distributed delay to each center's initiation, but any other initiation model could be incorporated in a similar manner. We stress that full prediction intervals should take this uncertainty into account.
In this work, we focus on patient recruitment regardless of the numbers of dropouts observed. In practice, screening failure and patient withdrawal are both prevalent in clinical trials. Assuming the dropouts are independent of the recruitment process, existing survival analysis techniques such as Cox's proportional hazard model (Cox, 1972) or accelerated failure time frailty model (Wei, 1992) could be used in combination with the recruitment model to produce distributions of the numbers of patients in the system at a given time. Such knowledge would be useful to the practitioners and operational researchers in charge of drug-supply chains for the centers. Anisimov and Fedorov (2007) introduced a method for determining the number of additional centers needed to be initiated for the study to finish on time. With minimal adaptation, the same method can also be used with our model. However, since it assumes that all new centers are initiated immediately, it may not apply in all scenarios. We would advocate a simulation-based approach, where forecasts based on different center initiation schedules are compared. As different operational costs can be associated with different schedules, this would become a resource-constrained optimization problem.

SOFTWARE
Software in the form of R code is available at https://github.com/SzymonUrbas/ct-recuitment-prediction.

SUPPLEMENTARY MATERIAL
Supplementary material is available online at http://biostatistics.oxfordjournals.org.

ACKNOWLEDGMENTS
Conflict of Interest: None declared.

FUNDING
The authors are grateful to UK Research and Innovation (EP/L015692/1) and AstraZeneca for sponsoring this research.