Bayesian sample size determination in basket trials borrowing information between subsets

Basket trials are increasingly used for the simultaneous evaluation of a new treatment in various patient subgroups under one overarching protocol. We propose a Bayesian approach to sample size determination in basket trials that permit borrowing of information between commensurate subsets. Specifically, we consider a randomised basket trial design where patients are randomly assigned to the new treatment or a control within each trial subset (`subtrial' for short). Closed-form sample size formulae are derived to ensure each subtrial has a specified chance of correctly deciding whether the new treatment is superior to or not better than the control by some clinically relevant difference. Given pre-specified levels of pairwise (in)commensurability, the subtrial sample sizes are solved simultaneously. The proposed Bayesian approach resembles the frequentist formulation of the problem in yielding comparable sample sizes for circumstances of no borrowing. When borrowing is enabled between commensurate subtrials, a considerably smaller trial sample size is required compared to the widely implemented approach of no borrowing. We illustrate the use of our sample size formulae with two examples based on real basket trials. A comprehensive simulation study further shows that the proposed methodology can maintain the true positive and false positive rates at desired levels.


Introduction
Clinical research in precision medicine (Mirnezami et al., 2012;Schork, 2015) continues to thrive as a consequence of the rapid technological advances for identifying possible prognostic and predictive disease factors at the genetic level (Aronson and Rehm, 2015;Morganti et al., 2019). Because of this, an increasing number of biomarker-driven therapies have been formulated. In oncology, for example, much attention has been paid to therapies targeting one or multiple genomic aberrations (Kim et al., 2011;Hyman et al., 2018;Redman et al., 2020). In contrast to conventional chemotherapy devised for treating histology-defined populations, such targeted therapies can potentially be beneficial to patients of various cancer (sub)types. Immune-mediated inflammatory diseases (IMIDs) (McInnes and Gravallese, 2021) are another area where targeted therapies can be useful (Grayling et al., 2021). IMIDs generally involve a clinically diverse group of conditions that share common underlying pathogenetic features, calling for the development of effective immune-targeted therapeutics (Pitzalis et al., 2020). This paradigm shift towards precision medicine has challenged the use of traditional one-size-fits-all approaches to trial design, which aim to estimate the population average treatment effect.
Master protocols (Woodcock and LaVange, 2017) comprise a class of innovative trial designs that address multiple investigational hypotheses. Newly emerging types include basket trials that can simultaneously evaluate a new treatment in stratified patient subgroups displaying a common disease trait (Renfro and Sargent, 2017;Tao et al., 2018). An implication of the stratification is that patients may respond very differently to the same treatment due to their distinct disease subtypes, stages or status. Fully acknowledging the heterogeneity could lead to the use of stand-alone analyses that regard the stratified subgroups in isolation. Such an analysis strategy, though adopted in early basket trials, may not be ideal for realising the promise of basket trials. This is mainly because it (i) fails to treat the combined (sub)trial components as a single study, and (ii) often yields low-powered tests of the treatment effect, due to the small sample sizes. Sophisticated analysis models, which feature borrowing of information between subgroups (ideally between those with commensurate treatment effects only), have been proposed in the statistical literature. One pivotal strategy is to fit a Bayesian hierarchical random-effects model (Thall et al., 2003;Berry et al., 2013), assuming that the subgroup-specific treatment effects are exchangeable, i.e., as random samples drawn from a common normal distribution with unknown mean and variance. This methodology has been extended to involve (i) a finite mixture of exchangeability distributions, see, e.g., Liu et al. (2017); Chu and Yuan (2018); Jin et al. (2020); as well as (ii) non-exchangeability distributions such that subgroups with an extreme treatment effect can be inferred to their own devices (Neuenschwander et al., 2016). The former reflects the concern that some subsets of trial may be more commensurate between themselves than others. A highly relevant proposal is to further cluster the subgroups, so that the corresponding treatment effects are assumed to be exchangeable within the same cluster. Chen and Lee (2020) present a two-step procedure, by which subgroups are clustered using a Bayesian nonparametric model, before fitting an adjusted Bayesian hierarchical random-effects model.
A few authors have further recommended to discuss the exchangeability or commensurability of any two or more subgroups. These proposals are for better characterisation of the complex trial data structure, which could involve mixtures of exchangeable or non-exchangeable patient subgroups. Psioda et al. (2021) apply a Bayesian model averaging technique (Hoeting et al., 1999) to accommodate the possibility that any configuration of subgroups may have the same or disparate response rate. Hobbs and Landin (2018) construct a matrix containing elements with value of 0 or 1, indicating that any pair of subgroups can be exchangeable or non-exchangeable. Alternatively,  propose measuring the pairwise (in)commensurability by distributional discrepancy to enable an appropriate degree of borrowing from each complementary subgroup, which yields a largest weight allocated to the most commensurate one(s).
Development of methods to choose an appropriate sample size for basket trials, however, appears to fall behind. A widely implemented approach is to sum up the sample sizes, calculated as if these trial subsets are to be carried out as separate studies. Whilst this could impair the efficiency of decision making, alternative approaches to sample size determination that permit borrowing of information are lacking. In this paper, we propose formal sample size planning for the design of basket trials. It strikes a balance between the sample size saving and the need of enrolling a sufficient number of patients to assure inferences about the subgroup-specific treatment effects. As the importance of randomised controlled trials has been increasingly emphasised in oncology (Ratain and Sargent, 2009;Grayling et al., 2019), IMIDs (Grayling et al., 2021) and rare-disease (Prasad and Oseran, 2015) research, this paper will focus on randomised basket trial designs with a primary objective of simultaneously comparing the new treatment against a control in various patient subgroups. We will thus develop our sample size formulae, presuming that the analysis is performed using a model adapted from .
The remainder of this paper is organised as follows. In Section 2, we introduce a Bayesian model that estimates the treatment effect specific to subgroups using entire trial data, as well as the derivation of sample size formulae appropriate for basket trials. Two data examples are presented in Section 3 to illustrate the use of our formulae for the design of randomised basket trials. In Section 4, we describe a simulation study that evaluates the operating characteristics of randomised basket trials. Finally, we conclude with a brief discussion and highlight several areas that deserve future research in Section 5.

Leveraging complementary subtrial data into commensurate priors
Let us consider the design of a basket trial where patients can be classified into K subgroups. These patients nonetheless share a common feature (e.g., a genetic aberration, clinical symptom or mechanism of drug action), on which a new targeted therapy may potentially improve patient outcomes. Each component study in a distinct patient subgroup will hereafter be referred to as a trial subset (i.e., 'subtrial' for short). Within each subtrial k, patients are randomised to receive either the experimental treatment (labelled E) with probability R k ∈ (0, 1), or a control (labelled C) with probability (1−R k ), for k = 1, . . . , K. We further assume that the measured responses are normally distributed with their own subtrial-specific parameters: X ijk ∼ N (µ jk , σ 2 k ), for j = E, C; k = 1, . . . , K. Letting n k denote the subtrial sample size, the difference in mean responses isX , per subtrial k. For the ease of notation, we let θ k = µ Ek − µ Ck denote the treatment effect for subtrial k. It is important to clarify at the outset that this design aims to estimate the subtrial-specific treatment effects, i.e., θ 1 , . . . , θ K , instead of an overall treatment effect averaged over all subtrials. If permitting borrowing of information across subtrials, these treatment effects are to be estimated using the entire trial data (with K k=1 n k patients) rather than in isolation (with n 1 , . . . , n K patients, respectively).
We follow  in specifying commensurate priors for each θ k , using information from the (K −1) complementary subtrials indexed by q = k, ∀k = 1, . . . , K. This methodology regards any θ q as a biased representation of θ k , yet the direction and the size of such bias are unknown (Hobbs et al., 2011). More specifically, these commensurate priors are formulated as conditional normal distributions that are centred at θ q s, respectively; whilst the precisions (i.e., reciprocal of variances), denoted by ν qk , accommodate the heterogeneity between two subtrials k and q. Our commensurate prior models for the continuous location parameter θ k can thus be given by where a two-component Gamma mixture prior (with a 1 /b 1 and a 2 /b 2 being the respective means of the component distributions), instead of a spike-and-slab prior in the original proposal, is placed on each ν qk for the convenience of analytic tractability . In particular, these two Gamma mixture components correspond to extreme cases of substantial or limited discounting of information from a complementary subtrial q. For illustration, we suppose that the first Gamma mixture component has its density massively on small values, and the second component on large values. The prior mixture weight w qk ∈ [0, 1], which plays a role of balancing between the extreme cases, can thus reflect one's preliminary scepticism about the degree of commensurability between θ k and θ q . That is, when subtrials k and q are thought of as incommensurate (commensurate), w qk can be set close to 1 (0), thus forcing the conditional prior variance ν −1 qk towards large (small) values for substantial (limited) discounting.
By integrating out ν qk , the conditional prior for θ k given θ q only follows a shifted and scaled t mixture distribution, with its two components both centred at θ q . This unimodal t mixture distribution can further be approximated by matching the first two moments to give which incorporates the respective variances of the t component distributions. As has been shown by , this normal approximation provides good properties for the coverage of credible intervals of interest. Note that the location of each commensurate prior, θ q , is an unknown parameter. It captures the information from a complementary subtrial q, of which the required sample size n q as well as the allocation proportion R q is yet to be determined.
Let x q = {x 1Eq , . . . , x nqEq ; x 1Cq , . . . , x nqCq } denote the data of a complementary subtrial q. We consider the difference of sample means,X qE −X qC , as the random variable to draw the Bayesian inference. With an 'uninformative' operational prior θ q ∼ N (m 0q , s 2 0q ), we derive the posterior as wherex jq denotes the average response of samples by treatment group j = E, C, within subtrial q.
Combining (2) and (3), we obtain a commensurate prior for θ k that leverages data of a complementary subtrial q = k: Consider now borrowing information from all complementary subtrials, with K ≥ 3. Let x (−k) denote the data from all subtrials excluding k, that is, all the (K − 1) sets of complementary data for subtrial k. By the convolution operator (Grinstead and Snell, 1997), we stipulate a collective, commensurate prior for leveraging all complementary subtrial data: where p qk are the synthesis weights, with q p qk = 1, assigned to the respective commensurate priors specified using x q . These synthesis weights can be transformed from the chosen values for w qk in the commensurate prior models, following an objective-directed approach . More specifically, we expect the largest synthesis weight, p qk , is assigned to the most commensurate prior N (λ q , ξ 2 qk ), specified based on a subtrial q = k that manifests the smallest discrepancy with subtrial k out of all the (K −1) complementary subtrials. Recall each w qk , as one key parameter to determine N (λ q , ξ 2 qk ), would have been chosen to appropriately reflect the pairwise discrepancy (i.e., incommensurability). One may apply a decreasing function of w qk to compute p qk . A K × K matrix can be constructed to contain all w qk in column k and row q as:  We note that this matrix should be symmetric with w qk = w kq , since each is intended to reflect the level of pairwise incommensurability. That is, the magnitude of incommensurability between subtrials k and q is the same as that between q and k. If stratifying the matrix by column, the off-diagonal elements in column k represent the postulated levels of discounting with respect to the complementary subtrial data.
Recall that the latter have been used to specify the respective commensurate priors in the form of (4), with q = k. The decreasing function given by has been illustrated to have satisfactory properties . The concentration parameter c 0 , if set equal to a value close to 0 + , appropriately discerns the (K − 1) values of w qk ; thus, a p qk → 1 would be assigned to the corresponding commensurate prior for θ k based on x q , in which the smallest w qk has been used. Otherwise, a value of c 0 w qk yields nearly all p qk to equal 1/(K − 1) irrespective of the values of w qk . Moreover, this transformation yields equal p qk when all w qk are equal. We generally recommend setting c 0 to a value that is substantially smaller than the magnitude of w qk ; for a thorough evaluation of performance by varying c 0 , we refer the reader to .
By using the Bayes' Theorem, the collective commensurate prior in the form of (5) will be updated by the contemporary subtrial data x k to give the posterior distribution as The posterior mean is a convex combination of the prior mean p qk λ q (q = k) and the data likelihood. We will give the exact expression of d θ k in Section 4 to carry out the decision making for simulated trials.

Sample size formulae for basket trials comparing two normal means
The frequentist approach to sample size determination makes use of hypothesis testing, with H 0k : θ k ≤ 0 against H 1k : θ k > 0, if assuming that greater values of X ijk indicate better effect. In this traditional framework, a study sample size is computed such that the treatment effect, θ k , will be found significant at a level α with probability 1 − β, given a certain magnitude of the treatment effect considered clinically meaningful.
We follow the Bayesian decision making framework, presented by Whitehead et al. (2008), to compute two interval probabilities from our posterior distribution as derived in (6), so that the subtrial sample sizes, n 1 , . . . , n K , are sought for providing compelling evidence of E being either superior to or not better than C by some magnitude δ in each subtrial k = 1, . . . , K. The posterior distribution of θ k specific to each subtrial k = 1, . . . , K will thus be evaluated to declare that E is where η and ζ are probability thresholds for the success and futility criteria, respectively. By using this decision rule, the two posterior tail probabilities of θ k are controlled. Specifically, we desire the area under the density from −∞ to the left of 0 to be limited below 1 − η, with the area under the density from ∞ to the right of δ to be below 1 − ζ. The sample size therefore needs to be sufficiently large for a decisive declaration of the treatment effectiveness or futility per subtrial k. That is, d θ k /σ θ k ≥ z η or (δ − d θ k )/σ θ k ≥ z ζ should be guaranteed. Here, z η satisfies Φ(z η ) = η where Φ(·) denotes the standard normal distribution function, with z ζ defined similarly. We thus require δ/σ θ k ≥ z ζ + z η , which leads to The left-hand side of (9) is precisely the posterior precision for θ k .
By contrast, based on the proposed Bayesian model for borrowing of information, σ 2 θ k comes from the closed-form expression in (6). This leads to which looks similar to (10), but involves the commensurate prior variances ξ 2 qk in the form of (4). The latter leverages the complementary subtrial information. Thus, a smaller integer for n k could be expected if the complementary subtrials, labelled q = k, are to collect rich information and, further, considerable borrowing of information happens. To ensure the inference in all K subtrials, we require that ∀k = 1, . . . , K, with p qk transformed from w qk following the stipulation in Section 2.1. The K nonlinear equations of n k and its n q s are continuously differentiable. We apply Newton's method for systems of nonlinear equations (Dennis and Schnabel, 1983) to find n 1 , . . . , n K that satisfy the K constraints simultaneously, given known w qk and σ 2 k .
The importance of w qk , in computing subtrial sample sizes is of particular interest. Whilst the definition of data (in)commensurability can vary on a case-by-case basis, these values may better be specified in collaboration with a subject-matter expert. Those conversations may also help quantify the magnitude of incommensurability, particularly in the absence of pilot data or relevant investigation. We caution that the choice of w qk should reflect the level of pairwise (in)commensurability a priori in the practical implementation, rather than for the desire of obtaining a minimal sample size.

A UK-based basket trial for treating patients with chronic diseases
The randomised, placebo-controlled Obeticholic acid for the Amelioration of Cognitive Symptoms trial (known as the 'OACS trial', ISRCTN15223158) aims to assess the efficacy of Obeticholic acid (E), as compared to a placebo (C), for treating cognitive deficits. The OACS trial is split into three subtrials, each focusing on a distinct patient subpopulation defined by the disease stage and clinical area. Namely, OACS-1 for patients with stabilised primary biliary cholangitis (PBC) > 2 years since diagnosis; OACS-2 for patients with new-onset PBC ≤ 2 years; and OACS-3 for patients with Parkinson's disease. The primary outcome is a composite cognitive test score obtained from the CANTAB platform (Goldberg, 2013), which is an extensively used tool in clinical practice. The reduction in the composite CANTAB score from the baseline, after 26 weeks of treatment, will be analysed as a normally distributed primary endpoint. We assume the magnitude of such reduction can be adequately depicted by values ranging from -5 to 5, where a high value suggests improvement of cognitive symptoms in a patient.
The sample sizes have been determined as 40 (20 on E and C each) for OACS-1, 25 (15 on E and 10 on C) for OACS-2, 25 (15 on E and 10 on C) for OACS-3, assuming that these subtrials are to be analysed on their own. As specified in the trial protocol, these were not computed based on hypothesis testing considerations originally. However, the resulting sample sizes are consistent with σ 2 1 = 6.177, σ 2 2 = 5.134, σ 2 3 = 5.134 to ensure 90% statistical power for OACS-1 and 80% for the the other subtrials to detect a difference of δ = 2.3, whilst controlling the type I error rate below 0.05. In the following, we use these quantities to illustrate the application of the proposed methodology, if the OACS basket trial would have been designed using the Bayesian decision making framework outlined above.
If no borrowing is permitted, n 0 k = 39.8, 24.8, 24.8 according to (10). This is about the same as the actual sample sizes of 40, 25, 25, respectively. For illustration purpose only, we use the following matrix of w qk   w 11 w 12 w 13 w 21 w 22 w 23 w 31 w 32 w 33 and c 0 = 0.05 to compute the synthesis weights p qk , following the objective-directed approach outlined in Section 2.1. The subtrial sample sizes are found to be n k = 33.3, 11.8, 18.2, fixing η = 0.95 and ζ = 0.90 for subtrial 1 or 0.80 for subtrial 2 maintaining the same treatment allocation ratios R k = 0.5, 0.6, 0.6. These are considerably smaller than the subtrial sample sizes assuming no borrowing or the frequentist counterparts.

Simultaneous evaluation of a new inhibitor in seven cancer subtypes
The ongoing SUMMIT basket trial (NCT01953926) adopted a single-arm design to evaluate a new pan-HER kinase inhibitor neratinib in 141 patients with HER2-mutant or HER3-mutant tumours . A binary outcome (i.e., responder or no responder, corresponding to a tumour shrinkage ≥ 30% or below) was used in line with the RECIST criteria (Eisenhauer et al., 2009). The SUMMIT trial additionally reported analysis on secondary outcomes, which include the change in tumour volume on a continuous scale of -100% to 100%. We assume a new randomised basket trial would follow, wherein the change in tumour volume from -100% to 100% is the primary outcome. A negative sign indicates the clinical benefit, since it is hoped that the tumour shrinks from the baseline measurement due to the treatment. With δ < 0, the trial decision criterion expressed by (7) and (8) should be altered as (i) We narrow the focus on seven of the originally investigated 21 cancer subtypes only, of which the mean responses for patients receiving neratinib (E) were approximately µ Ek = −0.489, 0.226, −0.181, 0.293, 0.329, −0.275, −0.136. We further assume that the mean responses on a control treatment embedded in the newly planned basket trial are µ Ck = 0, and that patients within each subtrial have equal probability to receive E or C; that is, R k = 0.5 for k = 1, . . . , 7.

Simulation study 4.1 Basic setting
Motivated by the SUMMIT trial, we consider the sample size planning of basket trials following the same data structure. That is, the basket trial would enroll n 1 , . . . , n 7 patients to the respective subtrials, with R k = 0.5 in all K = 7 subgroups, under six possible scenarios. Figure 1 visualises the six simulation scenarios, where the location and length of lines suggest the distributions of X ijk , j = E, C, while a larger bubble corresponds to a larger value of w qk . Here, we have followed the specification of w qk given in Section 3.2 to compute the pairwise Hellinger distance to characterise (in)commensurablity and obtain w qk for the levels of borrowing/discounting strength. Scenarios 4 and 6 correspond to two special cases of the treatment being consistently effective (alternative hypotheses) and consistently futile (null hypotheses), respectively. Both scenarios feature perfect commensurability; that is, the outcomes X iEk and X iCk have their respective, same distribution across subtrials, so all w qk = 0. Scenario 5 represents a mixed null situation, where θ k = 0 holds for four of the subtrials only. The other scenarios are constructed to reflect various levels of data incommensurability. Exact configurations of these simulation scenarios, i.e, values of µ Ek along with all µ Ck = 0 as well as the subtrial-specific variances σ 2 k , are listed in Table S1 of the Supplementary Materials.
[ Figure 1 about here.] We retain the prior specification and the probability thresholds unchanged from Section 3. Table 1 thus gives the subtrial sample sizes required to reach a decisive conclusion of either E is superior to or not better than C by δ = −0.4 using the respective sample size formulae. Because no w qk has been set to 1 in any scenario, the sample sizes n k computed from the approach of borrowing are generally smaller than n 0 k from the approach of no borrowing. Scenario 1 was constructed from the illustrative data example in Section 3.2. Since relatively large values have been chosen for w qk , only a slight decrease in sample sizes is observed. Unlike Scenario 1 that displays divergent effects, Scenario 2 is featured with a higher degree of commensurability that E has an enhanced benefit over C in all subtrials. A smaller trial sample size is then required. Scenario 3 has all the variances σ 2 k = 0.3. Consequently, n 0 k , based on the approach of no borrowing, are solved to be equal to 46.4 for all subtrials. Whereas, using the proposed methodology, the sample sizes for subtrials 1, 3, 4 and 6 are smaller, as these are recognised to be more commensurate between themselves than with the other three. A similar explanation can be given to Scenario 5: subtrials 2, 4 and 5 have greater sample size savings because the corresponding w qk takes smaller values. Scenarios 4 and 6 represent the situations of perfect commensurability. With all w qk = 0, a substantial reduction in the subtrial sample sizes is observed.
[ Table 1 about here.] In the numerical evaluation below, we simulate the outcomes X ijk , j = E, C, from N (µ Ek , σ 2 k ) and N (µ Ck , σ 2 k ) for patient i = 1, . . . , n k , within subtrial k = 1, . . . , 7. For each scenario, 100,000 replicates of the basket trials are simulated to fit: • the proposed Bayesian model, which yields the posterior distributions for θ k in the form of (6), with , • a Bayesian stand-alone analysis model for no borrowing of information. Operational priors, i.e., N (m 0k , s 2 0k ), are placed on each θ k . This leads to the posterior distributions for θ k based on x k alone, which has the same form as (3), with the subscript q replaced by k. We set all m 0k = 0 in the simulations.

Results
We summarise the frequency of simulated trials concluding that E is either efficacious or futile, based on the 100,000 replicates per scenario and model. Figure 2 depicts the percentages of (sub)trials declaring effectiveness of E and those declaring futility. Wherever the lengths of bars sum up to 100%, this means the study is planned with an adequate sample size for decisive decision making. As we can observe, collecting data from n 1 , . . . , n 7 patients to fit the proposed analysis model ensures 100% of the (sub)trials to conclude that E is either superior to or not better than C by δ. Whereas, it is not the case (i.e., all below 100%) if implementing the Bayesian model for no borrowing, since larger sample sizes (i.e., n 0 k in Table 1) would be required to ensure the same level of posterior distribution informativeness for the trial decision. In Scenarios 1 and 2 where n k and n 0 k were comparable, these two Bayesian models yielded comparable proportions of (sub)trials with a decisive trial decision. Yet in Scenarios 4 and 6 where substantial sample size saving were made, a disparity is observed, because the posterior distributions for θ k based on x k alone are far less informative than those based on x 1 , . . . , x K .
[ Figure 2 about here.] In Scenarios 2 and 3, E is potentially superior than C yet the magnitude tends to be smaller than desired on average. Only subtrial 1 has a mean treatment effect greater than δ, so about 91.2% of the simulated (sub)trials have declared E being effective. By contrast, subtrials 2 and 5 have mean treatment effects closest to 0 and δ, respectively. Therefore, subtrial 2 has higher chance to declare futility than effectiveness of E, but subtrial 5 is on the contrary. Scenario 4 mimics the borderline case where the mean treatment effect just has the size of δ. Using the proposed methodology, about 82.1% of the simulated (sub)trials have favoured E for effectiveness in all seven subtrials. These subtrialwise true positive rates are about equal to our chosen threshold ζ = 0.80. Scenario 5 assumes a mixture of subtrial-specific treatment effects with θ k = 0 or ≥ δ. Referring to subtrials 2, 4, 5 and 7, less than 5% of the simulated trials conclude effectiveness erroneously. The two Bayesian models yield similar operating characteristics in this scenario, as the computed n k and n 0 k were close. In Scenario 6, the proportion of incorrect decision of effectiveness is maintained below 5% for all subtrials using the proposed methodology. Unsurprisingly, using the approach of no borrowing to analyse the basket trial from only 62.3 patients has much lower chance of obtaining a definitive conclusion. The overall false positive rate (i.e., probability of incorrectly rejecting at least one subtrial that has true θ k = 0), based on the proposed methodology, is 0.150 for Scenario 5 and 0.054 for Scenario 6. These increase to 0.192 and 0.346, respectively, if the approach of no borrowing is implemented instead. This is not surprising because the sample sizes were computed to control the error rate at the subtrial level. For strong control of the overall error rate, multiplicity adjustment such as the Bonferroni procedure is required. After the correction, one can still expect more benefit from borrowing information than not.
Focusing on Scenarios 4 -6 for the true positive and false positive rates at the subtrial levels, the proportions are not exactly 80% or 5% because of the simulation randomness. Additional simulations have been carried out for homoscedastic scenarios by varying the value of σ 2 k . Figure 3 shows (i) the subtrialwise sample sizes, n k , determined based on our sample size formula in (11), and correspondingly, (ii) the subtrialwise true positive and false positive rates based on the simulated 100,000 replicates of basket trials. Each set of the additional simulations yielded seven points (for K = 7 subtrials), which congregate at the levels around ζ = 0.80 or 1 − η = 0.05. In summary, the proposed methodology can lead to the control of error rates.
[ Figure 3 about here.] We have also performed a sensitivity analysis to understand the effect of misspecified values of w qk . Table S2 in the Supplementary Materials reveals that the proposed methodology is reasonably robust to the misspecification of w qk . Nonetheless, care is needed when the value of w qk in the analysis deviates too far from that used in the design. When w qk is set to a larger value in the analysis than in the design (i.e., less borrowing is implemented than planned), a smaller percentage of trials conclude with a decisive decision. Whereas, a smaller value of w qk would yield a more informative posterior distribution, but this sometimes produces ambiguous conclusion of effectiveness or futility.

Discussion
The importance of choosing an appropriate sample size can never be overemphasised (Senn, 2007). Whilst basket trials have major infrastructural and logistical advantages, sophisticated statistical models are needed for the sample size planning to preserve the added efficiency. The most widely-used approach to date is based on a Bayesian stand-alone analysis model, which does not support information sharing across subtrials with commensurate treatment effects. Consequently, the majority of basket trials recruit a higher sample size than required. This not only causes a waste of resources, but could sometimes be unethical for exposing more patients than is necessary to a treatment that is yet to be fully approved (Altman, 1980). To realise the promise of basket trials, this paper establishes a closed-form solution to the simultaneous determination of subtrial sample sizes. The simulation study shows that the proposed methodology requires a smaller trial sample size wherever 0 ≤ w qk < 1, without undermining the chance of detecting if there exists a clinically relevant difference between the experimental treatment and the control.
For deriving our sample size formulae, we adopted the Bayesian decision making scheme elaborated by Whitehead et al. (2008). Specifically, it involves two probability thresholds η and ζ for reaching a decisive statement on the treatment's effectiveness or futility. In our numerical illustration, we set η = 0.95 and ζ = 0.80 because these probability quantities yield n 0 k , obtained based on the approach of no borrowing, comparable to the frequentist solution of sample sizes with α = 0.05 and β = 0.20. Other choices may certainly be feasible: there is no conventional level to set these probability thresholds. In practice, these quantities might be difficult to justify: fixing η to, for example, 0.95 might mean a considerable increase in sample size compared to 0.90. Since the sample sizes also depend on other parameters, we recommend the user generates plots for their cases following the pattern of our Figure S2 in the Supplementary Materials.
Two sets of key parameters to implement the proposed methodology are the variances σ 2 k and the levels of pairwise discounting w qk . Similar to the widely-used frequentist formulae (Chow et al., 2007), an increase in σ 2 k would mean that larger sample sizes are needed to maintain the same level of precision in data. We have restricted our focus on known variances throughout, since this is common in conducting clinical trials for most circumstances. Appropriate values for σ 2 k to compute the subtrial sample sizes can be informed by pilot data or information from relevant investigations. If retaining σ 2 k as unknown parameters, priors about their magnitude would be required. Subtrial sample sizes would then be sought by controlling the average property of posterior interval probabilities of θ k with respect to 0 and δ, since these nuisance parameters need to be integrated out from the posterior.  derived sample size formulae with unknown σ 2 k for a relevant context. Although different decision criteria were considered, one could follow their methodology to obtain the marginal posterior distributions, for which external information on σ 2 k may further be incorporated. For wider applications, we have extended the proposed methodology for basket trials using a binary (in both the randomised controlled and single-arm settings) or time-to-event outcome; see Sections C -E of the Supplementary Materials for the corresponding sample size formulae. In the meanwhile, we note there are limitations; for example, the censoring assumptions for time-to-event data are greatly simplified. We hope this work will stimulate further research within this Bayesian decesion framework.
In the present work, data are supposed to be analysed after the completion of all subtrials. However, in practice, certain subtrials may take much longer to complete recruitment due to low prevalence. One could (a) adopt a 'first (subtrials) complete, first analysed' principle, or (b) alter the constraint for simultaneously solving n 1 , . . . , n K , e.g., by making them proportional to the prevalence of respective target subpopulations, whilst maintaining an overall statistical power or decision accuracy. For strategy (a), more borrowing would be possible from subtrials that complete faster to those that complete slower. With strategy (b), all subtrials may finish about the same time to yield a joint data analysis. We should note that it is not obvious if either strategy leads to a substantial increase in the total sample size.
When borrowing of information is permitted, a reduced sample size can be expected by setting w qk < 1. The smaller the values of w qk , the more borrowing is possible. The present methodology requires that these values be specified to reflect the pairwise (in)commensurability of subtrial data. This is especially feasible when a pilot study has been conducted. More details about the practical implementation, particularly the specification of parameters, are available in Section F of the Supplementary Materials. Throughout, we have elaborated the methodology concerning a pre-specified magnitude relating to the effect size, δ, to find subtrial sample sizes. Extending the calculation to consider subtrial-specific effective sizes, say, δ k , is straightforward. A smaller value of δ k would indicate that a larger n k is needed, if all other parameters are held fixed. For practical implementation, the user may substitute the corresponding argument (currently as a single value) by a vector in the openly available software.
Our sensitivity analysis in Section G of the Supplementary Materials suggests the proposed methodology is reasonably robust against misspecification of w qk . Nevertheless, when the values deviate too far from the truth, the resulting sample sizes would not reflect what is needed to achive the trial's objectives. One avenue for future research would therefore be developing methodology for sample size reassessment in basket trials. Practitioners may start with rather conservative choices of w qk assuming limited borrowing, and re-estimate w qk using accumulating data from the ongoing trial at interims. As the reassessment depends on observed early-stage data, there is a risk that the error rates across stages cannot be maintained at the intended levels and subsequent work will investigate avoiding this inflation. The proposed methodology may also be extended to enable mid-course adaptations. For instance, the basket trial will begin with a few subsets of interest, and then restrict enrollment to the ones wherein patients benefit satisfactorily from the treatment based on an interim analysis. Boundaries for early stopping of certain subsets must be carefully defined to protect the overall error rates. With a reduction in the number of subsets, the synthesis weights p qk should be updated to satisfy the constraint of q p qk = 1 for the late stage(s).

Software
All statistical computing and analyses were performed using the software environment R version 4.0.3. Programming code for implementing the sample size formulae and reproducing the numerical results, is available at https://github.com/haiyanzheng/BasketTrialsSSD.    Figure 2: Percentage of (sub)trials that conclude E is efficacious (the left half of each plot) or not better than C by δ = −0.4, i.e., observing a shrinkage of 40% in the tumour volume (the right half of each plot). True subtrial-specific treatment effects, θ k = µ Ek − µ Ck , have been indicated in a second y-axis.

Supplementary Materials for:
Bayesian sample size determination in basket trials borrowing information between subsets by Haiyan Zheng, Michael Grayling, Pavel Mozgunov, Thomas Jaki, James Wason

A. IMPACT OF SEVERAL KEY PARAMETERS ON THE SAMPLE SIZES
In this section, we illustrate how sample sizes change along with certain key parameters, such as the variances and w qk , for a special case, where the basket trial has two subgroups only (i.e., K = 2). For illustration, we have set s 2 0k = 100, a 1 = b 1 = 2, a 2 = 54, b 2 = 3 and c 0 = 0.05. Assuming σ 2 1 = 0.587 2 , σ 2 2 = 0.345 2 (values were extracted from the SUMMIT trial which appears in Section 3.2 of the main paper) and δ = 0.4, sample sizes are sought for the inferences about θ k with η = 0.95 and ζ = 0.80. As expected, the variance σ 2 k is a key factor for the determination of the corresponding n k . More precisely, sample size of subtrial 1 is generally greater than that of subtrial 2 when holding other parameters fixed. Figure S1 confirms that the sample size of subtrial 1 (subtrial 2) additionally relies on w 21 (w 12 ) which controls the degree of borrowing from subtrial 2 (subtrial 1): the n k s in the same row within the subtrial 1 plot and those in the same column within the subtrial 2 plot remain equal by nearest integer for most cases. The only exceptions are the row with w 21 = 0 in the subtrial 1 plot and the column with w 12 = 0 in the subtrial 2 plot, where such variation is often trivial. We note this is caused by the use of Newton's method for solving the nonlinear equations. Figure S1 also indicates that an increase or decrease in w qk does not lead to linear change on the sample size of subtrial k, but with [0, 0.3] being a more sensitive range than (0.3, 1]. Substantial saving in the total sample size would be achieved if setting w 21 and w 12 to small values, for example, both as 0. Focusing on the third plane, the number at the bottom left goes larger when moving towards the top right.  Figure S1: Bayesian sample sizes of the respective subtrials and entire basket trial with K = 2, given various levels of incommensurability. Furthermore, n 1 and n 2 computed here based on the proposed sample size formula are bounded 1 by 52.6 (resulting from w 21 = 1) and 18.2 (resulting from w 12 = 1), respectively. These are close to n 0 1 = 53.2 and n 0 2 = 18.4, as obtained from the approach of no borrowing. We now explore the behaviour of our Bayesian sample size formulae (when setting 0 ≤ w qk ≤ 1) in comparison to that of the approach of no borrowing. Focusing on homoskedastic cases (i.e., σ 2 k remain identical across subtrials), Figure S2 shows that the trial sample size increases as the probability thresholds, η or ζ, increase. This is unsurprising, since a larger probability threshold would mean a more informative posterior distribution for θ k to reach a decisive conclusion.
We also observe that sample size reduction is possible by setting a small value for w qk . More specifically, ∑ n k drops quickly when reducing w qk from 0.1 to 0 in both panels (i) and (ii). Looking across the plots of the same panel, the larger (smaller) the variances, the greater (less) the sample size would be required.
We then visualise the required sample sizes for different standardised effect sizes, δ/σ 2 k , in Figure S3. We set σ 2 k = 0.25 throughout this illustration and δ = 0.60, 0.55, 0.50, 0.45, 0.40, 0.35, 0.30. Unsurprisingly, larger sample sizes are required to detect a smaller standardised effective size. Figure S3 also suggest that a value close to 0 for w qk results in substantial saving of sample size.

B. SIMULATION SCENARIOS AND ADDITIONAL RESULTS
Table S1 lists the six scenarios that have been used in our simulation study of the main paper. Additional simulation results based on a sensitivity analysis will also be presented in this section.
Additional simulations have been performed to illustrate the proposed methodology controls  Table S1: Simulation scenarios depicted as outcome distributions for the experimental treatment, N(µ Ek , σ 2 k ), k = 1, . . . , 7. The corresponding outcome distribution for the control is N(0, σ 2 k ).
Subtrial the error rates at desired levels for each subtrial. Figure S4 visualises the accuracy of decision under two new mixed null scenarios, i.e., Scenario A with θ 1 = θ 3 = θ 6 = −0.4 and Scenario B with θ 1 = θ 3 = θ 6 = −0.5 along with θ 2 = θ 4 = θ 5 = θ 7 = 0 in both. The variances are set the same as those displayed in Scenario 5 above. As we can read from the figure, assuming the desired effect size δ = −0.4, the percentage of simulations incorrectly declaring E as efficacious for subtrials with θ k = 0 is maintained below 5% using the proposed methodology.

C. EXTENDED APPLICATION TO USING A BINARY OUTCOME
When a binary outcome is used in randomised basket trials, the primary interest is to compare the response rates on the respective treatments, denoted by ρ Ek and ρ Ck , in each subtrial k = 1, . . . , K.
In this section, we adapt the proposed methodology to enable borrowing of information in terms of log-odds ratios, which are approximately normally distributed [1]. Let n k be the subtrial sample size, for k = 1, . . . , K. The log-odds ratio based on the individual  Figure S4: Percentage of (sub)trials that conclude E is efficacious (the left half of each plot) or not better than C by δ = −0.4 under two new scenarios.
subtrial data is . Following the methodology proposed in the main paper, we represent the complementary subtrial data in commensurate priors and place a two-component Gamma mixture prior on each commensurate parameter. This gives with Applying the same decision criterion with a clinically meaningful effect size, denoted by δ, the subtrial sample sizes can be found so that the K nonlinear equations hold simultaneously:

D. EXTENDED APPLICATION TO USING A TIME-TO-EVENT OUTCOME
We now consider extending the proposed sample size formulae to design randomised basket trials with a time-to-event outcome. For simplicity, we follow George and Desu [2] to assume that the event time, denoted by T ijk , has an exponential distribution: T ijk ∼ Exp(π jk ), i = 1, . . . , n k ; j = E, C; k = 1, . . . , K, where the rate parameter π jk > 0. Denote the average event time on treatment group j byT jk and the number of events by D jk in subtrial k = 1, . . . , K. By the centrial limit theorem, we know Using the delta method, we obtain log(T jk )∼N − log(π jk ), 1 D jk , and further that Likewise, we let θ k = log( π Ck π Ek ) and follow the proposed Bayesian methodology to enable borrowing of information between subtrials. This gives with Applying the same decision criterion as the main paper, the subtrial sample sizes can be found according to Equivalently, where D k = D Ek + D Ck denotes the total number of events required per subtrial k, and R k is the randomisation ratio to the experimental treatment E.

E. EXTENDED APPLICATION TO SINGLE-ARM SETTINGS WITH A BINARY OUTCOME
As one reviewer noted, basket trials are frequently conducted within a phase II oncology setting. Suppose that patients with a common feature (e.g., genomic aberration or clinical symptom) are enrolled to a basket trial with K subsets; each subset represents a subtype of the disease. Within each trial subset (i.e., 'subtrial' for short) k = 1, . . . , K, patients are treated by a new treatment (labelled E) and observed to be either a 'responder' or 'non-responder' according to predefined criteria relating to the tumour shrinkage. Let Y k denote the number of responders and n k the number of patients per subtrial k. Thus, E(Y) = n k ρ k and Var(Y) = n k ρ k (1 − ρ k ), where ρ k is the subtrial-specific response rate. By the delta method, the estimator of log-odds has an asymptotic normal distribution of Letting θ k = log(ρ k /(1 − ρ k )), we follow the proposed methodology to specify commensurate priors for θ k , based on the complementary subtrial data, and place a two-component Gamma mixture prior on each commensurate parameter. This then leads to with Applying the same decision criterion, the subtrial sample sizes can be found to satisfy the K nonlinear equations simultaneously: Figure S5 guides through the specification of parameters (specific to the design and data type) to implement the proposed Bayesian sample size formulae for basket trials, wherein borrowing of information is enabled wih w qk < 1. As noted, setting η = 0.95 and ζ = 0.80 creates a resemblance to the frequentist formulation of the problem in situations of no borrowing (with all w qk = 1). The user may modify these probability thresholds for their own case. Raising the level of η or ζ would mean an increase in the sample size. We recommend the user to visualise such changes following the pattern of Figure  S2. This would lead to effective communication with the stakeholder for finding sample sizes that are both affordable and sensible for estimating the effect size. In the randomised, two-arm settings, we have considered equal randomisation throughout by setting R 1 = · · · = R K = 0.5. The user may change the allocation ratio for their context.

F. PRACTICAL IMPLEMENTATION OF THE PROPOSED SAMPLE SIZE FORMULAE
Having chosen a design and an outcome to evaluate the clinical benefit, the user should seek information regarding (i) plausible response rate (ρ k or ρ jk for j = E, C, if using a binary outcome) or (ii) variability of responses (σ 2 k , if using a continuous outcome), both specific to subtrial k = 1, . . . , K, along with a clinically meaningful effect size (denoted by δ ′ , δ * , δ or δ † in Figure S5). This would be substantially easier if pilot data or historical information is available.
Specification of w qk to correctly reflect the pairwise incommensurability might be challenging in practice. This should best be informed by pilot data or existing information from relevant investigation. Following the derivation of the sample size formulae in Section 2 of the main paper, values for w qk can be chosen independently of the variances, σ 2 k , k = 1, . . . , K. For continuous data, however, values of σ 2 k and w qk can both be based on existing information from, e.g., a pilot study or preceding trial. We have exemplified this in Section 3.2 of the main paper, where the Hellinger distance is used to measure the discrepancy between the outcome distributions. The latter can (b) Synthesis weights: transformed from the columnwise non-zero w qk s by where we recommend setting c 0 to a value close to 0 + .

Binary
Response rates: ρ 1 , . . . , ρ K Effect size: δ ′ Figure S5: A roadmap for parameter specification in the proposed sample size formulae to design basket trials that may adopt a (i) non-randomised, single-arm, or (ii) randomised, two-arm setting in the respective subsets. Specific to the data type, the effect size (denoted by δ ′ , δ * , δ or δ † ) is on a scale of log-odds, log-odds ratio, mean difference or log-hazard ratio, respectively. be assumed with available data from a preceding trial (i.e., the SUMMIT trial in Section 3.2). For basket trials using a binary or time-to-event outcome, one may likewise compute the Hellinger distance between asymptotic normal distributions of the subtrial-specific log-odds, log-odds ratio or log-hazard ratio.
In these circumstances, the assumed values for response rates or variances affect the levels of w qk , and the impact on the corresponding subtrial sample sizes may thus be of interest. We refer the reader to Table 1 of Section 4, which expands the illustration of sample size determination with various scenarios. Specifically, the variances in subsets 2 -7 are increased from Scenario 1 (used in Section 3.2) to Scenario 3, which gives a different set of values for w qk ; see Figure 2 of the main paper for the visualisation. The resulting sample size is then an outcome of the increased variances and declined levels of pairwise incommensurability (i.e., increased amount of borrowing): the sample sizes n 2 , . . . , n 7 based on the proposed approach increase from Scenario 1 to Scenario 3, whereas the magnitude of such increase is not as large as that of n 0 2 , . . . , n 0 7 based on the approach of no borrowing; please see Table 1 of the main paper for this.
For the hyperparameters, the only constraint is a 1 , a 2 > 1. As shown with illustrative examples in the main paper, we recommend setting these hyperparameters to give a small prior mean (e.g., a 1 /b 1 ≤ 1) by the first component and a large prior mean (e.g., a 2 /b 2 ≥ 10) by the second.

G. CONSEQUENCE OF PARAMETER MISSPECIFICATION FOR w qk
Following the specification of w qk outlined in Section 3.2 of the main paper, we compute the Hellinger distance between any pair of N(µ Ek , σ 2 k ). Thus, we obtain      w 11 w 12 · · · w 17 w 21 w 22 · · · w 27 . . . . . . · · · . . . for Scenario 1 and all w qk = 0 for Scenario 6. When the same matrix of w qk is used to analyse the trial, the operating characteristics of the proposed design are as visualised in plots (a) and (f) of Figure 3 of the main paper, respectively. For our interest in the impact of misspecification of w qk , we assume that three 7 × 7 matrix, wherein w qk = 0.1, 0.3, 0.5 (for q ̸ = k) along with w kk = 0, k = 1, . . . , 7, would be applied in the Bayesian analysis. In the following sensitivity analysis, we simulate 100,000 replicates of the basket trial with the same parameter configuration as that of the main simulation study.  Table S2 shows that when true values of w qk are used in the analysis, all of the simulated trials will be assigned a decisive subtrialwise conclusion (100% = Effectiveness% + Futility%). When w qk is specified as a greater value in the analysis than that used in the design (meaning that the amount of borrowing is attenuated), a smaller proportion of trials will reach a decisive decision 8 making. This is evident by the results for Scenario 6 in Table S2 at the rows of w qk = 0.1, 0.3, 0.5 (all being > 0 which is the value of w qk used in the design). The more the values deviate from the designated level, the fewer trials could be concluded correctly on the futility of treatment in Scenario 6. By contrast, when w qk would be specified to a smaller value, some trials may produce an ambiguous decision on effectiveness or futility; see, for example, k = 1 in Scenario 1 with all w qk = 0.1 which are consistently lower than the values used for the design. When the magnitude of deviation is small, the impact of misspecification is trivial: the percentages for k = 7 in Scenario 1 with w qk = 0.3 are comparable to those with true w qk used in the analysis.
Finally, we would reemphasise that the change of w qk does not lead to a linear increase or decrease in the corresponding sample size. As Figure 1 in the main paper illustrates, the region bounded by 0 and 0.3 tends to be more sensitive to the other half, particularly from 0.5 to 1. Therefore, attention would be needed for the misspecification of w qk , if the true values are {either} within the sensitive region {or} the deviated across the region (e.g., setting w qk = 0.55 in the design yet w qk = 0.25 in the analysis, or the other way around).