Designing three-level cluster randomized trials to assess treatment effect heterogeneity

SUMMARY Cluster randomized trials often exhibit a three-level structure with participants nested in subclusters such as health care providers, and subclusters nested in clusters such as clinics. While the average treatment effect has been the primary focus in planning three-level randomized trials, interest is growing in understanding whether the treatment effect varies among prespecified patient subpopulations, such as those defined by demographics or baseline clinical characteristics. In this article, we derive novel analytical design formulas based on the asymptotic covariance matrix for powering confirmatory analyses of treatment effect heterogeneity in three-level trials, that are broadly applicable to the evaluation of cluster-level, subcluster-level, and participant-level effect modifiers and to designs where randomization can be carried out at any level. We characterize a nested exchangeable correlation structure for both the effect modifier and the outcome conditional on the effect modifier, and generate new insights from a study design perspective for conducting analyses of treatment effect heterogeneity based on a linear mixed analysis of covariance model. A simulation study is conducted to validate our new methods and two real-world trial examples are used for illustrations.


INTRODUCTION
Pragmatic cluster randomized trials (CRTs) are increasingly popular for comparative effectiveness research within health care delivery systems based on real-world interventions and broadly representative patient populations.While the average treatment effect (ATE) has typically been the primary focus in planning pragmatic trials, interest is growing in understanding whether the treatment effect varies among prespecified patient subpopulations, such as those defined by demographics or baseline clinical characteristics.By design, the flexible inclusion of a range of clusters and patients to mimic real-world practice in pragmatic trials naturally induces more heterogeneity, an aspect that should be reflected at the design stage and which invites the study of the associated factors that may lead to variation in treatment effects.
While data-driven exploratory heterogeneity of treatment effect (HTE) analysis can be ad hoc and used to generate hypotheses for future studies, confirmatory HTE analysis is often hypothesis driven, and carried out based on prespecified effect modifiers identified from either prior data or subject-matter knowledge (Wang and Ware, 2013).A recent systematic review reported that 16 out of 64 health-related CRTs published between 2010 and 2016 examined HTE among prespecified demographic patient subgroups but also noted a lack of methods for designing CRTs to enable the assessment of confirmatory HTE (Starks and others, 2019).To this end, Yang and others (2020) proposed an analytical power calculation procedure for HTE analysis in CRTs with two-level data where patients are nested in clusters.Although planning CRTs for assessing the ATE only requires accounting for the outcome intraclass correlation coefficient (ICC), planning CRTs for assessing the HTE additionally requires accounting for the covariate-ICC (or equivalently, ICC of the effect modifier).The covariate-ICC captures the fraction of between-cluster covariate variation relative to the total of between-and within-cluster covariate variation and plays an important role in sample size determination for HTE analysis.
Although these two distinct types of ICCs were identified as essential ingredients for estimating the power of the HTE test in two-level CRTs, analytical design formulas are currently unavailable for pragmatic trials exhibiting a three-level structure.As a first example, the Health and Literacy Intervention (HALI) trial in Kenya is a three-level CRT studying a literacy intervention to improve early literacy outcomes (Jukes and others, 2017).In the HALI trial, continuous literacy outcomes are collected for children (participants), who are nested within schools (subclusters), which are further nested in Teacher Advisory Center (TAC) tutor zone (cluster); randomization is carried out at the TAC tutor zone level.Beyond studying the ATE on the literacy outcomes, one may be interested in detecting effect modification by different levels of the baseline literacy measurements to see whether the intervention is more effective in improving spelling or reading test scores among those with lower baseline scores.Power analysis for such an interaction test requires accounting for the covariate-ICCs and outcome-ICCs within each school and across different schools.In a second example, the Strategies to Reduce Injuries and Develop Confidence in Elders (STRIDE) trial (Gill and others, 2021) aims to study the effect of a multifactorial fall prevention intervention program and recruited patients (participants) nested within clinics (subclusters), which were nested in health care systems (clusters); the randomization was at the clinic level and therefore the study can be considered as a subcluster randomized trial.The study included prespecified HTE analysis with patient-level effect modifiers such as age and gender, among others.Suppose the interest lies in detecting potential intervention effect modification by gender on the continuous outcome, concern score for falling, the power analysis for such an interaction test also required accounting for the covariate-ICCs and outcome-ICCs within each clinic and between different clinics.In both examples, as we discuss in Section 2.1, a direct application of the design formula developed in two-level CRTs can result in either conservative or anticonservative sample size estimates; therefore, the three-level hierarchical structure clearly requires new methods that enable a precise understanding of how both participant-and subcluster-level characteristics moderate the impact of new care innovations.Addressing this gap in three-level designs is further complicated by the fact that not only can the randomization be carried out at either the subcluster or cluster level but also the effect modifier can be measured at each level.To date, existing methods for power analysis in three-level trials were mainly developed for studying the ATE, with a closed-form design effect jointly determined by the between-subcluster outcome-ICC and the within-subcluster outcome-ICC (Heo and Leon, 2008;Teerenstra and others, 2010).Dong and others (2018) have developed sample size formulas for HTE analysis with a univariate effect modifier under cluster-level randomization.However, the essential role of covariate-ICCs was neglected in their design formulas, which lead to under-powered trials as we demonstrate in Section 3. Beyond Dong and others (2018), computationally efficient and yet accurate power analysis procedures are lacking to ensure sufficient power for HTE analyses in three-level trials.
In this article, we contribute novel design formulas to address multilevel effect modification in trials with a three-level structure, without restrictions on the level of randomization or the level at which the effect modifier is measured.We characterize a nested exchangeable correlation structure for both the effect modifier and the outcome conditional on the effect modifier and derive new sample size expressions for HTE analysis based on the asymptotic covariance matrix corresponding to a linear mixed analysis of covariance (LM-ANCOVA) model.With a small set of design assumptions, our closed-form expressions can adequately capture key aspects of the data generating processes that contribute to the variance of the interaction effect parameters are a good approximation to the true Monte Carlo variance for the sample size of interest (as we demonstrate in Section 3) and therefore obviate the need for exhaustive power calculations by simulations.Based on either a univariate effect modifier or multivariate effect modifiers, we prove in Section 2 that both the between-subcluster and within-subcluster covariate-ICCs are key elements in the asymptotic variance expression of the HTE estimator.Beyond power analysis for the HTE, in Section 2.3, we point out that the LM-ANCOVA framework can be used for more efficient analyses of the ATE, thus allowing the use of familiar sample size equations for testing the ATE.We report a simulation study to validate our new methods in Section 3 and provide two concrete examples in Section 4. Section 5 concludes.

POWER ANALYSES FOR CONFIRMATORY HTE ANALYSIS IN THREE-LEVEL TRIALS
We consider a parallel-arm design with a three-level data structure, including the common scenario where participants (level 1) are nested in subclusters (level 2), which are nested in clusters (level 3).Let Y ijk be the outcome from participant k (k = 1, . . ., m) from subcluster j (j = 1, . . ., n s ) in cluster i (i = 1, . . ., n c ).For design purposes, we assume each cluster includes an equal number of subclusters (n s ), and each subcluster includes an equal number of participants (m).We denote W ijk = 1 if participant k in subcluster j of cluster i receives treatment and W ijk = 0 if the participant receives usual care.While randomization is more frequently carried out at the cluster or subcluster level in three-level designs, our framework also allows participant-level randomization as we elaborate on below.
For prespecified HTE analysis with a set of p effect modifiers, X ijk = (X ijk1 , . . ., X ijkp ) T , we consider the following LM-ANCOVA model to formulate the HTE test: where γ i ∼ N (0, σ 2 γ ) is the random intercept at the cluster level, u ij ∼ N (0, σ 2 u ) is the random intercept at the subcluster level, and ijk ∼ N (0, σ 2 ) is the participant-level random error.We adopt the conventional assumption in CRTs that the random effects are mutually independent and assume the absence of additional random variations.In model (2.1), β 1 is the intercept parameter, β 2 is the main effect of the treatment, β 3 and β 4 are the main covariate effects and the treatment-by-covariate interactions.The null hypothesis of no systematic HTE across subpopulations defined by X ijk can be formulated as H 0 : β 4 = 0. Putting H 0 in the context of a single binary effect modifier where X ijk = 1 for female and X ijk = 0 for nonfemale, β 1 represents the mean outcome among the nonfemale subgroup without receiving treatment, β 2 represents the treatment effect for the nonfemale subgroup, β 3 represents the mean outcome difference between the gender subgroups, and the scalar parameter β 4 encodes the difference in treatment effect between the gender subgroups, and a two-sided HTE test can proceed with the Wald statistic based on a standard normal sampling distribution.Of note, when X ijk is mean centered, the interpretations of β 1 and β 2 will change (we discuss the interpretation of β 2 after mean-centering covariates in Section 2.3), but the interpretations of β 3 and β 4 remain unchanged, and therefore mean-centering covariates does not affect the HTE test for H 0 : β 4 = 0.
We assume the allocation ratio to the treatment and control groups is W /(1 − W ), with W = E(W ijk ) ∈ (0, 1).We consider three design configurations: (i) CRT with randomization at the cluster level such that n c W clusters are randomized to treatment and n c (1 − W ) clusters are randomized to usual care; (ii) CRT with randomization at the subcluster level such that within each cluster, n s W subclusters are randomized to treatment and n s (1 − W ) subclusters are randomized to usual care; (iii) individually randomized trial (IRT) with a hierarchical structure such that within each subcluster, mW participants are randomized to treatment and m(1 − W ) participants are randomized to usual care.In configurations (i) and (ii), we can further write the treatment indicator W ijk = W i , ∀ j, k and W ijk = W ij , ∀ k, respectively, but we maintain W ijk throughout for a more unified presentation.To facilitate the characterization of the covariance parameters, we reparameterize model (2.1) by centering the treatment indicator and obtain where b , and b 4 = β 4 .Let σ 2 y|x denote the total variance of the outcome conditional on X ijk .Under model (2.2), σ 2 y|x = σ 2 γ + σ 2 u + σ 2 .The strength of dependency between two outcomes conditional on covariates can be characterized by two distinct outcome-ICCs.The within-subcluster outcome-ICC describes the strength of dependency between two outcomes within the same subcluster, or The betweensubcluster outcome-ICC describes the degree of dependency between two outcomes from two different subclusters but within the same cluster, or then the implied correlation structure for Y i is nested exchangeable (Teerenstra and others, 2010) with the matrix expression given by where "⊗" refers to the Kronecker operator, I d and J d are a d × d identity matrix and d × d matrix of ones, respectively.Define the collection of design points T , and Z i = (Z i11 , . . ., Z ins,m ) T as the design matrix for cluster i.Given values of the variance components, the best linear unbiased estimator of b converges in distribution to a multivariate normal random variate with mean zero and covariance matrix Then, the asymptotic variance of b 4 (based on √ n cregime) is the lower-right element of the square matrix (∞,ns,m) .Design calculations for sample size and power then boil down to explicitly characterizing the key elements of (∞,ns,m) , which we operationalize below.

Univariate effect modifier
We first provide the analytical variance expressions when the treatment effect heterogeneity concerns a univariate effect modifier (p = 1).For testing H 0 : β 4 = 0 with a two-sided Wald test with type I error rate α and target power 1−λ, the required number of clusters, number of subclusters per cluster, and subcluster size, (or the total sample size), should satisfy n c n s m ≥ σ 2 4 (z 1−α/2 −z 1−λ ) 2 / 2 HTE , where z q is the qth quantile of the standard normal distribution, HTE is the interaction effect size and σ 2 4 refers to the lower-right element of n s m (∞,ns,m) , or σ 2 4 = lim nc→∞ n c n s mvar( β 4 ).Notice that (∞,ns,m) is a function of n s , m and decreases to zero as n s or m increases.Therefore, we intentionally define σ 2 4 by multiplying (∞,ns,m) with n s m to facilitate efficiency comparison when randomization is conducted at different levels.This way, the limit of σ 2 4 will remain as a constant instead of zero if we also let n s or m increase.To proceed, we write λ 1 = 1−α 0 , λ 2 = 1+(m−1)α 0 −mα 1 and λ 3 = 1+(m−1)α 0 +(n s −1)mα 1 as three distinct eigenvalues of the nested exchangeable correlation structure (2.3) (Li and others, 2018), based on which an explicit inverse is given by R For testing the HTE, we further assume a nested exchangeable correlation structure for the univariate effect modifier such that the marginal correlation matrix for where ρ 0 = corr(X ijk , X ijk ), ∀ k = k , defines the within-subcluster covariate-ICC and ρ 1 = corr(X ijk , X ij k ), ∀ j = j (without restrictions on k, k ), defines the between-subcluster covariate-ICC.
Figure 1 provides a graphical representation of the data structure in a three-level trial.Defining )mρ 1 as three distinct eigenvalues of the nested exchangeable correlation structure (2.4), we first establish the following result.
THEOREM 2.1 Defining σ 2 x as the marginal variance of the univariate effect modifier, the limit variance σ 2 is dependent on the level of randomization.(i) When randomization is carried out at the cluster level, we have (ii) When randomization is carried out at the subcluster level, we have (iii) When randomization is carried out at the participant level, we have 1) , with equality obtained in the absence of residual clustering (e.g., The proof of Theorem 2.1 can be found in Appendix A of the Supplementary material available at Biostatistics online.Depending on the level of randomization, Theorem 2.1 provides a cascade of variance expressions to facilitate analytical power calculations.First, in the absence of any residual clustering in a three-level design such that x }, which is essentially the limit variance of an interaction effect estimator in a simple ANCOVA model without clustering (Shieh, 2009).Therefore, we immediately notice that all three variances share the same form of the variance without clustering, σ 2 y|x /{W (1−W )σ 2 x }, multiplied by a design effect expression characterizing the nontrivial residual clustering, which we denote by θ 4,(1) / σ 2 4,(1) = λ 1 , when the randomization unit is the cluster, subcluster, and participant, respectively.When the randomization is at the cluster level, Cunningham and Johnson (2016) showed that the design effect for testing the ATE in a three-level trial is unbounded when the subcluster size m increases indefinitely.
In sharp contrast, with a participant-level effect modifier, the design effect under a three-level CRT is bounded above even if the subcluster size goes to infinity, as Intuitively, the HTE parameter β 4 corresponds to a participant-level interaction covariate that varies within each subcluster whereas the ATE corresponds to a cluster-level treatment variable that only varies between clusters; this subtle difference underlies the difference in the two design effects.Interestingly, when ρ 0 ≤ α 0 , then θ (3) (n s , ∞) ≤ 1 and there can even be variance deflation in testing HTE under a three-level CRT relative to an individually randomized trials without clustering.For the subcluster randomized trial, , and the design effect is similarly bounded in the limit.The design effect θ (1) , however, does not depend on n s or m and shares the same form with the one derived in Cunningham and Johnson (2016) for testing the ATE in a three-level IRT.
In addition, Theorem 2.1 demonstrates that randomization at a lower level leads to potential efficiency gain for estimating the interaction parameter β 4 .This ordering result parallels that developed for testing the ATE in three-level designs (Cunningham and Johnson, 2016) and is intuitive because randomization at a lower level allows the within-cluster or within-subcluster comparisons to inform the estimation of the associated parameter.All three variances are increasing functions of the conditional outcome variance σ 2 y|x and decreasing functions of the marginal covariate variance σ 2 x , matching the intuition that explained variation due to the effect modifier can lead to a HTE estimate with higher precision.Similar to the analysis of the ATE in IRTs or CRTs, a balanced design with equal randomization, W = 1/2, leads to the largest power for testing H 0 under a fixed sample size.Interestingly, the covariate-ICCs enter the variance for estimating β 4 in an orderly fashion, that is, σ 2 4,(1) is independent of ρ 0 and ρ 1 , and σ 2 4,(2) depends on ρ 0 but is The intraclass correlation parameter between outcomes from two participants within the same subcluster The intraclass correlation parameter between outcomes from two participants in two different subclusters The intraclass correlation parameter between covariates from two participants within the same subcluster The intraclass correlation parameter between covariates from two participants in two different subclusters ⇑ Indep Indep "Indep" indicates that the variance is independent of and thus does not change with the specific ICC parameter; "⇑" indicates a monotonically increasing relationship; "⇓" indicates a monotonically decreasing relationship; and " " indicates a nonmonotone and likely quadratic relationship.
free of ρ 1 , whereas σ 2 4,(3) depends on both ρ 0 and ρ 1 .This observation suggests that the variance of β 4 only depends on the covariate-ICCs defined within each randomization unit but not between randomization units.
Theorem 2.1 helps demystify the relationships between the key ICC parameters and design efficiency for estimating the HTE.When randomization is carried out at the cluster level, larger values of covariate-ICCs, ρ 0 , ρ 1 , are always associated with a larger σ 2 4,(3) (smaller power).This is expected because larger covariate-ICCs imply less per-participant information for estimating β 4 and therefore reduce statistical efficiency.The relationship between σ 2 4,(3) and outcome-ICCs, α 0 , α 1 , can be nonmonotone and is graphically explored in Figure S1 of the Supplementary material available at Biostatistics online, where σ 2 4,(3) exhibits a parabolic relationship with α 0 and α 1 .This implies that a direct application of the existing design formula in Yang and others (2020) by equating α 0 = α 1 and ρ 0 = ρ 1 can lead to either conservative or anticonservative predictions for σ 2 4,(3) and inaccurate sample size results in general.When randomization is carried out at the subcluster level, larger values of between-subcluster outcome-ICC, α 1 , is always associated with a smaller σ 2 4,(2) (larger power), whereas a larger value of covariate-ICC, ρ 0 , is always associated with a larger σ 2 4,(2) (smaller power).However, σ 2 4,(2) can be nonmonotone in the within-subcluster outcome-ICC, α 0 (see Figure S2 of the Supplementary material available at Biostatistics online for a graphical exploration).Finally, when randomization is carried out at the participant level, a larger value of within-subcluster outcome-ICC, α 0 , is always associated with a smaller σ 2 4,(1) (larger power), whereas σ 2 4,(1) is independent of other ICC parameters.Table 1 provides a concise summary of these relationships and we further expand on these remarks in Appendix B of the Supplementary material available at Biostatistics online.

Generalization to accommodate multivariate effect modifiers
Although a common practice for confirmatory tests of HTE is to consider a univariate effect modifier, the above analytical sample size procedure can be generalized to jointly test the interaction effects with multivariate effect modifiers.In this case, we write is the estimated covariance matrix of β 4 .For fixed effect size vector β 4 = HTE , the corresponding power equation of this Wald test is approximated by where χ 2 1−α (p) is the (1 − α) quantile of the central Chi-squared distribution with p degrees of freedom and f (x; p, ϑ) is the probability density function of the χ 2 (p, ϑ) distribution.Fixing any two of n c , n s , or m, solving (2.6) for the other argument (for example, using the uniroot function in R and rounding to the next integer above) readily gives the required sample size to achieve a desired level of power; therefore, sample size determination for HTE analysis with multivariate effect modifiers boils down to the characterization of 4 in explicit forms.
In Appendix C of the Supplementary material available at Biostatistics online, assuming a nested block exchangeable correlation structure for the multivariate effect modifiers, we prove a general version of Theorem 2.1.The general results establish the analytical forms of three covariance matrices of the p-vector of HTE estimators, 4,(3) , 4,(2) , and 4,(1) , when randomization is carried out at the cluster level, subcluster level, and participant level, respectively.Furthermore, these covariance matrices have Löwner ordering such that 4,(3) 4,(2) 4,(1) , with equality obtained in the absence of residual clustering (e.g., λ 1 = λ 2 = λ 3 = 1 or σ 2 γ = σ 2 u = 0).Finally, with a univariate effect modifier measured at the participant level, the LM-ANCOVA model (2.1) can also be used to detect contextual effect modification by decomposing the covariate effect into a cluster-level, subcluster-level, and participant-level components to address different effects for the aggregated and lower-level variations (Raudenbush, 1997).Correspondingly, the transformed vector of effect modifiers, and the more general Theorem derived in Appendix C of the Supplementary material available at Biostatistics online can be applied for analytical power analysis with contextual effect modification.We derive those explicit expressions in Appendix D of the Supplementary material available at Biostatistics online as an application of this more general result.

Connections to sample size requirements for studying the ATE
Even though the primary motivation for model (2.1) is to study HTE with prespecified effect modifiers, the LM-ANCOVA model also implies a covariate-adjusted estimator for the ATE.This is because the conditional ATE given by LM-ANCOVA is 4 X ijk , and therefore the marginal ATE parameter can be given by integrating over the distribution of effect modifiers, ATE = β 2 + β T 4 μ 1 .Often times, a practical strategy is to globally mean center the collection of effect modifiers such that the sample mean is zero.Without loss of generality, we assume μ 1 = 0 and therefore ATE = β 2 can be interpreted as the covariate-adjusted ATE estimator.Such an observation is akin to the ANCOVA analysis of IRTs, for which model (2.1) is a direct generalization with multiple random effects.Different from prior work on ANCOVA analysis of IRTs without residual clustering (Yang and Tsiatis, 2001), we assume model (2.1) is correctly specified such that an explicit model-based variance expression of the ATE estimator can be used as a basis for study planning.Specifically, we denote the asymptotic variance expression of β 2 as σ 2 2 = lim nc→∞ n c n s mvar( β 2 ), based on which the generic sample size requirement in a three-level design is Based on model (2.1), we provide a final result to facilitate power and sample size determination for testing the ATE in three-level designs (proof in Appendix E of the Supplementary material available at Biostatistics online).THEOREM 2.2 The limit variance expression of the covariate-adjusted ATE estimator depends on the unit of randomization and is at most a function of the outcome-ICCs.(a) When the randomization is carried out at the cluster level, we have σ 2 2,(3 The variances are linearly ordered such that σ 2 2,(3) ≥ σ 2 2,(2) ≥ σ 2 2,(1) , and equality is obtained when Theorem 2.2 implies that the design effect in a three-level design for estimating the ATE, as compared to an unclustered randomized design, is one eigenvalue of the outcome-ICC matrix R i , matching those derived by Cunningham and Johnson (2016) under three-level designs with one subtle difference.In Cunningham and Johnson (2016), the design effects were all derived assuming a linear mixed model without any effect modifiers and therefore the outcome variance and the outcome-ICCs in each eigenvalue are marginal with respect to effect modifiers.In contrast, the design effect expressions implied by Theorem 2.2 are parameterized by outcome variance and outcome-ICCs conditional on effect modifiers.As we demonstrate in Section 3, adjusting for effect modifiers can partially explain the residual variation of the outcomes at any level, and therefore may reduce either the outcome variance or the amount of residual correlation.In such a case, the conditional outcome-ICCs and conditional outcome variance are frequently no larger than their marginal counterparts, and therefore the ATE estimators based on LM-ANCOVA are likely to be more efficient than those under the unadjusted linear mixed model.This improvement in efficiency can directly translate into sample size savings.Finally, if the marginal variance of a univariate effect modifier is a unity such that σ 2 x = 1, regardless of the level of randomization, the large-sample variance expression for the covariate-adjusted ATE estimator is identical to the large-sample variance expression for the HTE estimator, when the effect modifier is measured at or above the unit of randomization.Namely, σ 2 2,(1) = σ 2 4,(1) regardless of the measurement level of the effect modifier; σ 2 2,(2) = σ 2 4,(2) when the effect modifier is measured at the subcluster or cluster level; and σ 2 2,(3) = σ 2 4,(3) when the effect modifier is measured at the cluster level.

NUMERICAL STUDIES
We carry out simulations to assess the finite-sample performance of the proposed sample size procedures for planning three-level trials.Our objectives are (i) assessing the accuracy of the proposed methods for powering three-level trials to detect HTE as well as the ATE and (ii) demonstrating, from a costeffectiveness perspective, whether the sample size estimates based on Theorem 2.2 can be smaller than those estimated by the approach given in Cunningham and Johnson (2016), for the same level of power.
Whenever applicable, we also compare our sample size results with those obtained from Dong and others (2018).To focus on the main idea, throughout we assume a univariate continuous effect modifier measured at the participant level.We consider designs with randomization at each of the three levels, leading to CRTs, subcluster randomized trials, and IRTs with a hierarchical structure.We assume equal randomization with W =1/2, and a balanced design with equal numbers of subclusters n s in each cluster and equal subcluster sizes m.
For the first objective, we fix σ 2 y|x and σ 2 x at 1, nominal type I error α = 0.05 and the desired power level 1 − λ = 0.8; the remaining parameters are varied.Under each design, we consider two levels of subcluster sizes m ∈ {20, 50} and two values for the number of subclusters per cluster n s ∈ {4, 8}.Since typically α 0 ≥ α 1 and ρ 0 ≥ ρ 1 , we chose (α 0 , α 1 ) ∈ {(0.015, 0.01), (0.1, 0.05)} to represent small and moderate conditional outcome-ICC, and (ρ 0 , ρ 1 ) ∈ {(0.15, 0.1), (0.3, 0.15), (0.5, 0.3)} to represent small, moderate, and large covariate-ICCs, based on ranges commonly reported in the CRT literature (Murray and Blitstein, 2003).To ensure that the predicted number of clusters is practical and mostly below 100, we set HTE = 0.1 for all randomization scenarios, ATE = 0.2 for the CRTs, and ATE = 0.1 for the subcluster randomized trial or IRT.For each parameter combination, we estimate the number of clusters n c that ensures at least 80% power and round to the nearest even integer above to ensure an exactly equal randomization.We then use the predicted cluster number n c to simulate correlated outcome data based on the LM-ANCOVA model and compute the empirical power of the Wald test for HTE or the ATE.When the randomization is carried out at the cluster level, we quantify the proportion of explained variation due to the effect modifier (details in Appendix F of the Supplementary material available at Biostatistics online) and use the method of Dong and others (2018) to obtain the required number of cluster n * c to ensure at least 80% power, as a comparator to our new method.
We generate the individual-level outcomes using the LM-ANCOVA model (2.1) by fixing β 1 = 1 and β 3 = 0.3.For the HTE tests under each randomized design, we fix β 2 = 0.2, and β 4 = HTE .For studying the empirical power of the Wald test for the covariate-adjusted ATE, we fix β 4 = 0.05 such that β 2 = ATE − β 4 μ x .We generate the correlated effect modifiers based on the linear mixed model, ), and c ij ∼ N (0, σ 2 x (1 − ρ 0 )).The cluster-specific random intercept in LM-ANCOVA γ i ∼ N (0, σ 2 y|x α 1 ), the subcluster-level random effect u ij ∼ N (0, σ 2 y|x (α 0 − α 1 )), and the random error ijk ∼ N (0, σ 2 y|x (1 − α 0 )).For each parameter configuration, we generate 5000 hypothetical trials to evaluate the empirical type I error under the null and empirical power under the alternative.For each hypothetical trial, we fit the LM-ANCOVA model using restricted maximum likelihood estimation and carry out the corresponding Wald test for inference.To evaluate the covariate-adjusted Wald test for the ATE, we first globally meancenter the effect modifier before fitting the LM-ANCOVA.Finally, while the sample size estimation for HTE test and ATE test under a three-level design can generally be based on the standard normal distribution, we consider the t-distribution with the between-within degrees of freedom (n c − 2) only when studying the ATE under a CRT (standard normal distribution will still be used for testing the ATE in both subcluster randomized trials and IRTs).This choice represents an effective small-sample degrees of freedom correction specifically for CRTs with a limited number of clusters (and will have negligible difference from the standard normal when n c is sufficiently large, also see Chapter 2.4.2 in Pinheiro and Bates (2006)) and has been previously shown to maintain valid type I error rate (Li and others, 2016) in small CRTs for testing the ATE and therefore is adopted to more objectively assess the agreement between empirical and predicted power under that specific scenario.For this scenario, we also confirm in Table S1 of the Supplementary material available at Biostatistics online that the predicted variance of the ATE estimator based on Theorem 2.2 is close to the Monte Carlo variance.
For the second objective, given each effect size ATE , we still generate the correlated outcome data using the LM-ANCOVA model as the above, but assume that the primary analysis of the ATE is based on a linear mixed model without the effect modifier.Correspondingly, we compute the required sample size using the formulas in the absence of any covariates.To operationalize those formulas, we obtain the total outcome variance marginalizing over the covariate distribution, σ 2 y .Furthermore, the unconditional outcome-ICCs can be approximated by α 0 ≈ ωα 0 + (1 − ω)ρ 0 and α 1 ≈ ωα 1 + (1 − ω)ρ 1 , where the explicit form of ω is derived in Appendix F of the Supplementary material available at Biostatistics online.The required sample size for the unadjusted mixed model analysis is then estimated from the Theorem 2.2 but by replacing α 0 , α 1 with α 0 , α 1 , and σ 2 y|x with σ 2 y , namely, using the formulas in Cunningham and Johnson (2016).We compared the results with those estimated based on Theorem 2.2 to investigate saving in sample size due to adjustment for the univariate effect modifier.Finally, we also obtain the empirical type I error rate and empirical power by fitting the linear mixed model omitting X ijk to verify the accuracy of the sample size procedure without the effect modifier.

Results
Table 2 provides a summary of the estimated number of clusters using the proposed formula in Theorem 2.1, the empirical size, empirical power as well as predicted power by formula of the Wald test for HTE when HTE = 0.1, and when the randomization is carried out at the cluster level (differences from 80% power are due to rounding).The Wald test maintains the nominal type I error rate, which ensures the validity of the subsequent comparisons between the empirical and predicted power.Across all scenarios, the predicted power is in good agreement with the empirical power, even when there are as few as 10 clusters.In Table 2, we also observe that the method of Dong and others (2018) often leads to much smaller sample size estimates ( n * c ) than the proposed method and therefore their method is consistently anticonservative.In Appendix F of the Supplementary material available at Biostatistics online, we have re-expressed their design formula using our notation and found that it does not depend on any covariate-ICC parameters.It is apparent from Table 2 that ignoring the covariate-ICC at each level during study planning can result in under-powered CRTs, especially in the presence of nontrivial covariate-ICCs.
Under cluster randomization, Table 3 provides a summary of the estimated number of clusters using the proposed formula in Theorem 2.2, the empirical size, empirical power as well predicted power by formula of the Wald test for the covariate-adjusted ATE when ATE = 0.2.The Wald test still maintains valid type I error rate, with empirical power close to analytical prediction by our formula.This suggests that our variance expressions are accurate for study design purposes.Interestingly, even though the ATE size is twice the HTE effect size, the required sample size to achieve 80% power is not always larger for the HTE test compared to the ATE and can depend on the remaining design parameters.Finally, we demonstrate that ignoring the univariate effect modifier in the study design stage can lead to larger than necessary sample size estimates under a wide range of design configurations (Table S2 of the Supplementary material available at Biostatistics online).In fact, for the same ATE size, we find that the required sample size based on the Cunningham and Johnson (2016) formula may even be 50% larger than that based on Theorem 2.2, as a result of explained variation.For example, as seen in Table S2 of the Supplementary material available at Biostatistics online, the adjusted outcome-ICCs α 0 and α 1 can be substantially smaller than their marginal counterparts, α 0 , α 1 , especially when the covariate-ICCs are farther away from zero.
Simulation results for when the randomization is carried at the subcluster level and at the participant level are presented in Tables S3-S8 of the Supplementary material available at Biostatistics online.The patterns are qualitatively similar to the results in Tables 2 and 3 as well as Table S2 of the Supplementary material available at Biostatistics online and confirm that our analytical power procedure can accurately track the empirical power for the Wald test for HTE and ATE.In addition, these results confirm the ordering properties of the variances under different randomized designs, with the cluster-level randomization requiring the largest sample size and the participant-level randomization requiring the smallest sample size, for studying both HTE and the ATE.Finally, we observe that, for studying the ATE, the sample Table 2.Estimated required number of clusters n c for the HTE test based on the proposed formula, empirical type I error (Emp.size), empirical power (Emp.power), as well as predicted power (Pred.power) for the HTE, test when randomization is at the cluster level.For studying power, the HTE effect size is fixed at HTE = 0.1.In the last two columns, we estimate the required sample size n * c using the method of Dong and others (2018) and obtain the actual predicted power (Actual power) using our formula based on the estimated sample size n * c to assess the degree to which the three-level CRT may be under-powered size saving by adjusting for the participant-level effect modifier is often the largest under a CRT and the smallest under an IRT with a hierarchical structure.

TWO REAL CRT EXAMPLES
We illustrate the proposed methods using two real trial examples, with a primary focus on the detection of confirmatory HTE.In both examples, we consider two-sided tests with a nominal 5% type I error and 80% power.
EXAMPLE 4.1 (The HALI trial) The HALI trial compared the effect of a literacy intervention with usual instruction in terms of students' literacy outcomes measured by performance on spelling and reading tests (Jukes and others, 2017).The study exhibits a three-level structure since the children participants are nested in schools (subclusters) which are further nested in TAC tutor zones (clusters); and the randomization of literacy intervention is carried out at the TAC level with W = 1/2.There are approximately m = 25 children in each school, and on average n s = 4 schools per TAC tutor zone.We focus on the 9-month spelling score as a continuous outcome and the baseline spelling score as a continuous potential effect modifier.Based on the estimates in Jukes and others (2017), we assume the conditional within-subcluster and between-subcluster outcome-ICCs as α 0 = 0.104 and α 1 = 0.008.We further assume the withinsubcluster and between-subcluster covariate-ICCs ρ 0 = 0.2 and ρ 1 = ρ 0 /2 = 0.1.Using Theorem 2.1(a), the required number of TAC tutor zones to detect HTE by the baseline spelling score for a standardized effect size, HTE σ x /σ y|x = 0.12 (interpreted as the impact due to one standard deviation unit change in baseline spelling score on standard deviation unit change in the 9-month spelling score) is n c = 24.To explore the sensitivity of power  insensitive to the magnitude of between-cluster covariate-ICC.While larger values of the within-cluster covariate-ICC can reduce the power of the HTE test to below 70%, variations of the outcome-ICCs within the range we considered maintain the power of the HTE test above 80%.
EXAMPLE 4.2 (The STRIDE Trial) The STRIDE trial was a subcluster randomized trial comparing the effectiveness of a multifactorial fall prevention intervention program versus enhanced usual care on patient's health outcomes (Gill and others, 2021).The study randomized primary care clinics to intervention conditions with allocation probability W = 0.5 and measured outcomes at the participant level; the clinics were nested within health centers, thus exhibiting a three-level structure.Each health center included about n s = 8 clinics, and the average clinic size was m = 63.In this illustrative example, we considered the concern score for falling as a continuous outcome and self-rated health (SRH; measures whether one has good/excellent SRH) as a binary effect modifier measured at the participant level (Gill and others, 2021).We assumed the conditional within-subcluster and between-subcluster outcome-ICCs as α 0 = 0.01 and α 1 = 0.005, and further assumed the within-subcluster covariate-ICC ρ 0 = 0.1.Using Theorem 2.1(b), n c = 10 health centers would be required to detect HTE moderated with the SRH for a standardized effect size δ/σ y|x = 0.2 (interpreted as the impact due to change in SRH on the standard deviation unit change in the concern score).To additionally explore the sensitivity of power to varying covariate-and outcome-ICC values, we present the power contours in Figures 2(c) and (d) as in Example 4.1.In particular, Figure 2(c) confirms that larger values of ρ 0 can decrease the power of the HTE test, but varying ρ 1 has no effect on the test power in a subcluster randomized trial.Varying values of the outcome-ICCs generally do not result in an under-powered HTE test, except when α 0 ∈ [0.01, 0.05] and α 1 /α 0 ≤ 0.4 (a scenario with a moderate within-cluster outcome-ICC but a small between-subcluster outcome-ICC), in which case the power appears just below 80%.Overall, these two examples illustrate how our design expressions can be effectively operationalized in practical design situations and emphasize the critical role of the within-subcluster covariate-ICC in determining the power of the HTE test.

CONCLUDING REMARKS
While prespecified HTE analysis has been a recognized goal in randomized trials, guidance to date on planning pragmatic trials involving clusters with HTE analysis is scarce.Through analytical derivations, we contribute a cascade of new variance expressions to allow for rigorous and yet computationally efficient design of pragmatic CRTs to power confirmatory HTE analysis with effect modifiers measured at each level, addressing a critical methodological gap in designing definitive pragmatic CRTs.Compared to the design of IRTs with HTE analysis (Brookes and others, 2004), the design of CRTs with HTE analysis requires more complex considerations due to the multilevel data structure and additional design parameters, especially the ICCs of the outcome as well as the effect modifier.While recent advancements in computational tools have made the use of a simulation-based power calculation an attractive alternative to analytical power predictions in clustered designs, such an approach is typically time-consuming and can quickly become impractical due to the need to search across multidimensional parameter spaces and repeatedly fitting multilevel models.Our proposed design formulas not only reduce the associated computational issues, but, more importantly, identify key aspects of the data generating processes that contribute to the study power.For example, the power of the HTE test is only affected by the HTE effect size but not the ATE size.Furthermore, the power of the HTE test only depends on the covariate-ICCs defined within each randomization unit but not between randomization units.Regardless of the level of randomization, the variance of the interaction effect estimator from the LM-ANCOVA model is proportional to the ratio of the conditional outcome variance and the variance of the effect modifier.In the context of a binary effect modifier, for example, the variance of the effect modifier is a function of the mean, and it follows that the variance of the interaction effect estimator reaches its minimum when the prevalence of the effect modifier is 0.5 (holding all other factors constant).These insights greatly simplify power analysis by obviating the need to exhaust ancillary design parameters as would otherwise be required in a simulation-based power calculation procedure, and can possibly provide a basis for deriving optimal designs for testing effect modification.
There are additional aspects that we do not address in this article but remain important topics for future research.First, we have assumed equal cluster sizes to arrive at the main results.In a two-level CRT, Tong and others (2021) recently developed a correction factor for the variance formula of the HTE estimator under variable cluster sizes and found that the impact of cluster size variation is minimal with an participant-level effect modifier but can be more substantial with a cluster-level effect modifier.We

Fig. 1 .
Fig. 1.A graphical representation of the data structure in a cluster with two subclusters in a three-level trial.Each oval represents a participant (level 1) nested in a subcluster (level 2) nested in the cluster (level 3).The effect modifier X and outcome Y are measured for each participant with their respective covariate-IC7Cs and outcome-ICCs depicted.A thicker arrow indicates a stronger correlation between variables.
as the set of p ≥ 2 baseline covariates and β 4 = (β 41 , . . ., β 4p ) T .We are interested in testing the global null hypothesis H 0 :β 4 = 0 based on a Wald test.From the LM-ANCOVA model (2.2), the scaled GLS estimator √ n c ( β 4 − β 4) is asymptotically normal with mean zero and variance equal to the lower-right p×p block of(∞,ns,m)  , which is denoted by 4 .This motivates a quadratic Wald test statistic n c β which converges to a Chi-squared distribution χ 2 (p, ϑ) with p degrees of freedom and noncentrality parameter ϑ = n c β T

Fig. 2 .
Fig. 2. Estimated power contours for studying the heterogeneous treatment effect in the HALI and STRIDE trials as a function of the covariate and outcome-ICCs.(a) and (b) are based on the participant-level continuous effect modifier, baseline spelling score, in the HALI CRT (n c = 24, n s = 4, and m = 25), while (c) and (d) are based on the participant-level binary effect modifier, SRH, in the STRIDE subcluster randomized trial (n c = 10, n s = 8, and m = 63).

Table 1 .
Definition of ICC parameters, and a concise summary of the relationships between ICC parameters and the variance of the interaction effect parameter with a univariate participant-level effect modifier (p = 1), under different levels of randomization

Table 3 .
Estimated required number of clusters n c for the covariate-adjusted ATE test based on the proposed formula, empirical type I error (Emp.size), empirical power (Emp.power), as well as predicted power (Pred.power) for the covariate-adjusted ATE test when randomization is at the cluster level.For studying power, the ATE size is fixed at ATE = 0.2