The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

Abstract Motivation Accurate power and sample size estimation is crucial to the design and analysis of genetic association studies. When analyzing a binary trait via logistic regression, important covariates such as age and sex are typically included in the model. However, their effects are rarely properly considered in power or sample size computation during study planning. Unlike when analyzing a continuous trait, the power of association testing between a binary trait and a genetic variant depends, explicitly, on covariate effects, even under the assumption of gene–environment independence. Earlier work recognizes this hidden factor but the implemented methods are not flexible. We thus propose and implement a generalized method for estimating power and sample size for (discovery or replication) association studies of binary traits that (i) accommodates different types of nongenetic covariates E, (ii) deals with different types of G–E relationships, and (iii) is computationally efficient. Results Extensive simulation studies show that the proposed method is accurate and computationally efficient for both prospective and retrospective sampling designs with various covariate structures. A proof-of-principle application focused on the understudied African sample in the UK Biobank data. Results show that, in contrast to studying the continuous blood pressure trait, when analyzing the binary hypertension trait ignoring covariate effects of age and sex leads to overestimated power and underestimated replication sample size. Availability and implementation The simulated datasets can be found on the online web-page of this manuscript, and the UK Biobank application data can be accessed at https://www.ukbiobank.ac.uk. The R package SPCompute that implements the proposed method is available at CRAN. The genome-wide association studies are carried out using the software PLINK 2.0 [Purcell et al. (Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75.)].


Introduction
Accurate power and sample size estimation is crucial to the design of many scientific studies, including the ubiquitous genome-wide association studies (GWAS) of complex and heritable human diseases and traits (Hong and Park, 2012).It is well known that replication studies with underestimated sample sizes can result in false negatives, missing single nucleotide polymorphisms (SNPs; G's) that are truly associated with the phenotype of interest (Y ) (Patil et al., 2016).Additionally, recent work (Turley et al., 2018) has shown that failure to correctly estimate power can also result in increased false positives in pleiotropy studies, where different traits are jointly analyzed and their GWAS summary statistics are aggregated.
The power and sample size calculation for a continuous trait is well established, as the phenotype-genotype association analysis is through the ordinary linear regression, regressing Y on G and important non-genetic covariates E's.It is then straightforward to show that the power of the corresponding genetic association test only depends on the effect size and minor allele frequency (MAF) of the SNP, sample size, and the unexplained phenotypic variance (Korte and Farlow, 2013).That is, when analyzing a continuous trait, the sample size for a replication study with sufficient power is determined by the proportion of phenotypic variance explained by genetic variants, which is also called narrow-sense heritability (Yang et al., 2017;J Mayhew and Meyre, 2017).
In contrast, the power calculation for a binary disease outcome requires additional considerations, as the association analysis typically uses the logistic or probit regression model (Robinson and Jewell, 1991;Sjölander and Greenland, 2013).Most heritability estimation methods were rigorously developed for continuous traits only (Weissbrod et al., 2018;Yang et al., 2010), and their applications to binary traits have been questioned (Golan et al., 2014).At the same time, when analyzing a binary outcome Y , power of analyzing a SNP G is affected, explicitly, by the effect size of a non-genetic covariate E, even if E is independent of G and/or there is no GxE interaction effect (Robinson and Jewell, 1991;Pirinen et al., 2012).Therefore, accurate power and sample estimation for a binary trait-genetic association analysis must explicitly consider the presence of non-genetic covariates.
There have been several attempts in the literature to consider the general problem of power and sample size computation for logistic regression.Whittemore (1981) derived an approximation method, assuming that the disease prevalence is small and the covariates have a joint distribution of multivariate exponential.The approach of Whittemore (1981) was similarly considered by Hsieh (1989), Hsieh et al. (1998), and Novikov et al. (2010).Based on the asymptotic power approximation of the score or likelihood ratio test under local alternatives, Self et al. (1992) and Self and Mauritsen (1988) proposed an alternative approach that accommodates several categorical covariates with finite configurations, which was then extended by Shieh (2000) to allow for one categorical covariate with infinite configurations.
For genetic association studies, Quanto is the most commonly used software in practice, implementing the method of Gauderman (2002a,b).The method uses the expected value of a likelihood ratio test (LRT) statistic and accommodates both continuous and categorical E's for power analysis of GxE interaction.However, the approach of Gauderman (2002b) implicitly assumes that G and E are independent of each other, which may not hold in practice for complex diseases (Plomin et al., 1977;Scarr and McCartney, 1983;Knafo and Jaffee, 2013;Zhu et al., 2018;Namjou et al., 2019).Further, the implemented software Quanto does not accommodate the presence of E unless the power computation is for GxE interaction analysis.That is, E cannot be included when the power analysis is for the main effect of G. Demidenko (2007), on the other hand, advocated the use of the Wald test to do the power and sample size computation for logistic regression, and proposed a method that allows E and G to be dependent through a second-stage logistic regression model.However, the implemented web-tool (Demidenko, 2008) only allows for one binary covariate, as otherwise the computation does not admit a closed-form expression.Lyles et al. (2007) proposed a different approach to power computation for generalized linear models, based on the use of an expanded representative dataset.The idea of expanded representative dataset provides accurate approximation with good computational efficiency when sample size is small to medium, but the computation becomes cumbersome when the sample size is large, while large sample size (and small genetic effect size) is a feature of many GWAS.
In this paper, we propose and implement a generalized method for estimating power and sample size for genetic association studies of binary traits that a) takes into account different types of non-genetic covariates E, b) allows for different types of G-E relationship, and c) has good computational efficiency for large-scale studies.The utility of the proposed method is illustrated and compared with the existing methods through extensive simulation studies and an application study of the UK Biobank data (Sudlow et al., 2015;Bycroft et al., 2017).The proposed method has been implemented as a R package, SPCompute, available at CRAN.

Models
For simplicity of the notation, we assume without the loss of generality that there is only one non-genetic covariate E; the method implementation and application allows for multiple E's.To study the relationship between a trait Y and a SNP G of interest, conditional on the non-genetic covariate E, we consider the following generalized linear model (glm;McCullagh and Nelder (2019)), where g(•) is a link function, connecting the linear predictor η with the mean function of Y .This glm model accommodates the analyses of both continuous and binary traits.Here we focus on binary traits, for which the logistic regression is the most commonly used model with g(µ) = log( µ 1−µ ).
Let X be the design matrix, which has rows , where n is the sample size.To ease notation, we also use X to denote the observed data, and we use β = (β 0 , β G , β E ) T to denote the vector of either all regression parameters or their true values.The linear predictor η is then expressed as Following the convention in genetic association studies, the SNP genotypes, aa, Aa and AA, are assumed to follow the Hardy-Weinberg equilibrium (HWE; Mayo (2008)), where A is the minor allele with MAF of p. Also by convention, G is coded additively, tracking the number of allele A. Thus,

The Wald test
To test the association between SNP G and trait Y , i.e.H 0 : β G = 0 vs. H 1 : β G = 0, one can consider different tests such as the LRT, Score or Wald tests.These tests have similar asymptotic behaviors under the null hypothesis, and they are locally equivalent (Rao et al., 1973;Serfling, 2009).However, as noted by Demidenko (2007), these three likelihood-based tests differ globally.As Wald tests are routinely used in GWAS, following the argument of Demidenko (2007), we carry out our power and sample size computation based on a Wald test.
The Wald test statistic in our setting is expressed as , where I −1 X ( β) [2,2] denotes the second diagonal element of the matrix I −1 X ( β), and β is the maximum likelihood estimate (MLE) of β.I X (β) is the observed or conditional Fisher information matrix, conditional on the observed X, defined as where W (β) is a n × n diagonal matrix, with the i th element as Under the null hypothesis, T is asymptotically χ 2 1 distributed.Using the above expression of w i , it is easy to see that when analyzing a continuous trait with residual variance σ 2 via linear regression (i.e. using the identity link function), ∂ui ∂ηi = 1 and w i = 1/σ 2 , which means I X (β) depends on σ 2 but not on any regression coefficients explicitly.In contrast, when analyzing a binary trait using logistic regression, which is a function of both β G and β E .Thus, the size of the non-genetic covariate effect β E explicitly influences the Fisher information matrix, hence the power analysis of β G .

The hidden factor in power and sample size computation
Assume the significance level of the test is α, and the sample size is large enough so that the asymptotic distribution of the Wald test statistic can be used.Let V G,X := I −1 X (β) [2,2] be the variance of βG , the power of the Wald test can be computed as where Z 1−α/2 denotes the 1 − α/2 quantile of the standard normal distribution.Worth reemphasizing is the fact that V G,X , thus the power of logistic regression, explicitly depends on both β G and β E , as discussed in Section 2.2 above.
The above power computation is for conditional power, conditional on the observed X.However, for sample size determination for a successful replication study, the corresponding power analysis is performed prior to observing any data.In that case, the power is referred to as the unconditional power (Lyles et al., 2007).
To compute the unconditional power, naturally we replace the conditional Fisher information matrix I X (β) above with its unconditional version I n (β).Let I x (β) be the unconditional Fisher information for a single observation x = (1, G, E) .For the logistic regression considered, where the expectation is taken over, F X , the distribution of the covariate space X (i.e. both G and E).
Once I 1 (β) has been computed for a given F X , the unconditional Fisher information matrix for a random sample of size n is where ] is based on the unconditional Fisher information matrix, I n (β).
To plan a successful replication study at the α level, the sample size n required to achieve a desirable power can be computed by simply inverting the power function, which is monotonic with respective to n.Although the sample size computation is for a specific genetic effect β G , it is clear that, similar to the conditional Fisher information, the unconditional Fisher information in (2), therefore V G,n in (3), also depends β E .Thus, this hidden factor must be explicitly accounted for when performing sample size calculation for a binary trait.

Designing the covariate space
To compute the unconditional Fisher information matrix I n (β), one needs to compute the moments and covariance of a random sample pair (G i , E i ) from the corresponding covariate space F X .An appropriately designed covariate space F X should be flexible enough to accommodate potential complex dependence structure between G and E, while conceptually simple enough so that that practitioners can make use of their domain knowledge.
In the work of Gauderman (2002b), the author implicitly assumed independence between G and E, requiring only the marginal distributions of G and E.Although this makes the method easy-to-implement, the assumption may not hold in practice (Plomin et al., 1977;Scarr and McCartney, 1983;Knafo and Jaffee, 2013;Zhu et al., 2018;Namjou et al., 2019).Furthermore, the implemented software Quanto only allows users to specify E when the target analysis is the G × E interaction effect, not the main effect of G.
The work of Demidenko (2007), on the other hands, allows F X to accommodate dependence between a binary G and a binary E, by introducing a second-stage logistic regression, where γ * 0 is determined by user-specified marginal probabilities of G and E. As a result, users will only need to additionally input the knowledge about γ * E to fully specify F X .However, the method of Demidenko ( 2007) is designed for a binary G (hence the typical GWAS additive coding of G not applicable) and a binary E, and its generalization to different types of G and E is nontrivial.
Here, we utilize the idea of a second-stage regression of Demidenko (2007) but extend it to a more general setting.Instead of treating E as a covariate in the second-stage regression, we consider it to be the response variable such that, where g 2 is the link function, being identity when E is continuous and logit when E is binary.
Compared with the second-stage regression model in ( 4), the proposed method can accommodate different types of E in a unified framework.When E is continuous, the regression model also requires Var(E|G) in order to be fully specified.The value of Var(E|G) can be computed based on user-provided information such as µ E , σ E and p.
We note that the proposed method can account for multiple E's by specifying the corresponding g 2 function for each E considered.Later, we will demonstrate the utility of our approach in a UK Biobank data application of hypertension, for which both age and sex are important covariates to consider for power and sample size computation.In the rest of this section, we assume there is only one covariate E for clarity of the presentation.
3.2 Proposed method 1: Semi-Simulation (P1.SS) The estimation of the unconditional power heavily depends on the computation of I n (β) := nI 1 (β).Unfortunately, unless in some special cases such as when both G and E are binary, I 1 (β) in ( 2) does not have a closed-form expression for a general F X (Demidenko, 2007).Thus, to estimate I n (β), we propose to use a sample estimate.
Specifically, for a large integer B, we simulate independent observations {G i , E i } B i=1 from the covariate space F X , and for each x i = (1, G i , E i ) we compute the corresponding conditional Fisher information matrix, where By a simple application of the law of large number, the sample estimate, converges almost surely to the true expected matrix I n (β) as B grows.
As we will illustrate later in the simulation studies, Ĩn (β) exhibits little variation for large B (e.g.>10,000).Furthermore, the proposed semi-simulation method is scalable, as for each I xi (β) we only compute the observed Fisher information matrix for one single observation.Thus, the computational load depends on B but is independent of the target sample size n.Once I n (β) is replaced by Ĩn (β), the power computation can proceed using equation ( 3), and sample size estimation by inverting the power function.
3.3 Proposed method 2: Representative Dataset (P2.RD) An alternative method that does not rely on plugging in the sample estimate of I n (β) is through the use of a representative dataset, an idea that was originally suggested by O'brien (1986) and later extended by Lyles et al. (2007).
In our setting, given a sample size n, assume there exists a representative covariate sample from the covariate space F X , which we define later.We then expand {x i } n i to consider both possible outcomes of the binary trait, so that each observation x i splits into {x i , y i = 0} and {x i , y i = 1}.Additionally, each expanded observation is given a weight, so that δ 0 i + δ 1 i = 1, where for l = 0 and 1.
Thus, the original representative dataset {x i } n i=1 is now expanded into the following representative dataset, which has 2n (weighted) observations.Standard MLE of V G,n , derived from the corresponding weighted log-likelihood, yields VG,n , which can be directly plugged into equation (3) to complete the power computation (Lyles et al., 2007).
It remains to be discussed what is a representative {x i } n i=1 and how the expanded representative dataset RD can be obtained in our study setting.In the case of conditional power analysis where covariates are already observed, the observed {x i } n i=1 can be directly used in (7) to establish the representative dataset of (8).For the unconditional power analysis, {x i } n i=1 can be obtained by using user-provided F X .Lyles et al. (2007) provided examples on how to define the notion of representative dataset for different types of F X .We follow the procedures of Lyles et al. (2007) for the types of F X considered in Section 3.1.
When E is binary and the link function in ( 5) is logistic, we can compute the expected counts for category {(G = i, E = j)}, i = 0, 1, and 2, and j = 0 and 1, as n i,j = nP(G = i, E = j), using the available information such as MAF and the inheritance mode, with appropriate rounding to ensure that n i,j 's are integers and sum to n.
When E is continuous and the link function is identity with Var(E|G) = σ 2 E , we first categorize the dataset based on G such that n i = nP(G = i), i = 0, 1, and 2. Then for each of the j = 1, . . ., n i observations of where Φ −1 is the inverse of the cumulative distribution function of the standard normal.

Overview of the simulation design
We compared the power and sample size computed using the proposed P1.SS and P2.RD methods (implemented as the R package SPCompute available at CRAN) with those computed using Quanto of Gauderman (2002b) (version 1.2.4 downloaded from http://hydra.usc.edu/GxE), and the method of Demidenko (2007) using its web-platform (dartmouth.edu/∼eugened/power-samplesize.php).
We considered three different scenarios for F X , including no covariate E (as Quanto does not allow for E), E being binary (as the method of Demidenko (2007) only allows for binary E), and E being continuous.Although we only considered one E in the simulation studies for method comparison, our implemented SPCompute R package allows for multiple E's, as demonstrated in our UK Biobank application study in Section 5. Finally, for the simulation studies we also considered three study designs, where S1 is case-control retrospective (Quanto only allows for the case-control study design), while S2 and S3 are prospective to reflect the design of the emerging biobank-sized data such as the UK Biobank data used in our application.
The accuracy of each method was assessed by comparing the computed power (and sample size) with the empirical values obtained through 1,000 independent replications.Given the large number of replications, the empirical values were treated as the oracle values and used to benchmark.We calculated the average and the maximum of the absolute error (AE) of each computed power (and sample size) as compared with the oracle values.The more accurate method is expected to have smaller mean AE and max AE.By convention, for replication sample size computation, the desirable power was set to be 80% at the significance level of 0.05, and for consistency, the same significance level was used for power computation.

Scenario 1: No covariate E with a case-control retrospective study design
The choice of no covariate effect was to accommodate the implementation of Quanto of Gauderman (2002b).Without loss of generality, the disease prevalence was assumed to be 20%, and the observations were obtained independently with a retrospective sampling design and the standard case-to-control ratio of 1-to-1.
The associated SNP has a MAF of 0.1, with a dominant effect β G ranging from log(1.1) to log(2.5).That is, the odds ratio (OR) of the associated SNP ranged from 1.1 to 2.5.The choice of a dominant genetic effect was to accommodate the implementation of Demidenko (2007) method, which only allows for a binary G. Finally, we used β 0 = −2, though we note that the intercept parameter does not affect the power of a case-control study.
4.1.2Scenario 2: Binary E with a prospective study design Similar to S1 above, in the second scenario the disease prevalence is also 20%, and the associated SNP with MAF of 0.1 has a dominant effect β G ranging from log(1.1) to log(2.5).However, the observations were obtained independently with a prospective sampling design as in the UK Biobank data.Additionally, the non-genetic covariate E has a population exposure rate of P(E = 1) = 0.3 with effect β E = log(2.5).Finally, γ G = log(0.2),quantifying the dependency between G and E as defined in (5).
To implement Quanto of Gauderman (2002b), which only allows for case-control study design, we used a case-tocontrol ratio of 1-to-4 to approximate the result for a disease with prevalence of 20%.Additionally, when the power analysis is about G main effect (as opposed to G × E interaction effect), Quanto does not consider the presence of E. Thus, we only input the information about G in the implementation of Quanto.The method of Demidenko (2007) accommodates the presence of one binary covariate E for the power (and sample size) computation; G must be binary, hence the dominant genetic model was assumed for method implementation.

Scenario 3: Continuous E with a prospective study design
For this last scenario, without loss of generality, the covariate E was assumed to follow the standard normal distribution conditional on G.The dependency between G and E was set to γ G = log(0.5).All other model specifications are the same as in S2 above, including the disease prevalence (20%), MAF (0.1), the genetic effect (ranging from log(1.1) to log(2.5)), and the non-genetic covariate effect (log(2.5).
As in the previous scenario, we ignored the information on E for the implementation of Quanto.For the method of Demidenko (2007), which only allows for a binary E, we considered two approaches.We first omitted the continuous covariate E (corresponding results * ), and we then dichotomized E by defining Ẽ := I(E > 0).This corresponds to creating two misspecified models, and As the parameter values specified for the true models (1) and ( 5) cannot be directly used for the two misspecified models, we used estimated γG and βE .We first simulated a large a number of observations {G i , E i , Y i } 3×10 5 i using the true model.We then dichotomized the continuous E to obtain Ẽ as specified above.Finally, we regressed Y on G and Ẽ, and Ẽ on G to obtain sample estimates of βE and γG for the second implementation of the method of Demidenko (2007).

Methods comparison across the three scenarios
Figure 1 shows the computed power and sample size curves, across the three scenarios, and the empirical results are consistent with our analytical expectation: Ignoring covariate effect, at the (replication) study planning stage, can lead to overestimated power and underestimated replication sample size for studying a binary trait.
In the left panel of Figure 1, the OR of the associated SNP was fixed at 1.5 in (a,c) and 1.3 in (e), and Figures 1(a), 1(c) and 1(e) show the estimated power for different sample size at the significance level of 0.05.Results clearly show that, if there are no covariates, all methods perform well (Figure 1(a)).In the presence of a binary covariate, Quanto tends to overestimate the power of an association study, while both Demidenko and the proposed two methods provide power estimates close to the Oracle values (Figure 1(c)).However, if the influential covariate is continuous (e.g.age as in our UK Biobank application for hypertension), only the proposed two methods (P1.SS and P2.DD) perform well (Figure 1(e)).The superior performances of the proposed two methods are also shown in Table 1, as P1.SS and P2.RD have smaller average and smaller maximum absolute error, as compared with the true power.
In the right panel of Figure 1, Figures 1(b), 1(d) and 1(f) show the estimated sample sizes, necessary to achieve 80% power at the 0.05 level, to successfully replicate an associated SNP with OR ranges from 1.1 to 2.5.Similar conclusion can be drawn here, as the existing methods tend to underestimate the necessary sample size for a successful replication study in the presence of influential covariate, while the proposed P1.SS and P2.RD methods are accurate.

Choice between the proposed P1.SS and P2.RD methods from the computational perspective
To select between the two proposed methods, P1.SS and P2.RD as respectively described in Section 3.2 and Section 3.3, here we study factors influencing the computational efficiencies of the two methods and make recommendations to practitioners.
Conceptually, the computational efficiency of the semi-simulation P1.SS method depends on B, the number of independent observations drawn from F X in order to obtain Ĩn (β) in (6).As Ĩn (β) is based on B replicates of one-sample I xi (β), the targeted sample size n does not have a direct impact on computational time.In contrast, the computational time of the P2.RD method depends on n, as the method first creates a representative dataset of size n from F X then expand it to weighted 2n observations as in (8).
To numerically demonstrate the computational properties of the two methods, without loss of generality, we considered simulation scenario S2 in Section 4.1.2and used β G = log(1.5)for illustration.Results in Figure 2(a) confirm our analytical expectation: The computational time of P2.RD grows linearly with respect to n, while that of P1.SS is independent of n.
However, the accuracy of P1.SS depends on large B; we used B = 10,000 (log 10 (B) = 4) in Figure 2(a).Figure 2(b) shows the stability of P1.SS with respect to log 10 (B).For each choice of log 10 (B) from 3.0 to 4.3, the log 10 sample standard error of the power (SE) computed by P1.SS, obtained from 1,000 independently simulated replicates, was shown in Figure 2(b).Results clearly show that B ≥ 10,000 (log 10 (B) ≥ 4) leads to negligible SE of less than 0.01 (log 10 (SE) < −2).Thus, the default value for B in our method implementation is 10,000.Interestingly, the relationship between B and SE appears to be approximately log-log linear.
We note that Y-axis in Figure 2(a) measures the run time in seconds for computation of one set of parameter values (i.e. per computation).In practice, it is often necessary to run power and sample size analysis for a fine grid of a large number of possible parameter values.Thus, the run time difference aggregates and can differ significantly between P1.SS and P2.RD.In general, when n is larger than 25,000, P1.SS is preferred over P2.RD.Thus, although the implemented software SPCompute includes both methods, it sets P1.SS as the default method.

Application to the UK Biobank data
To illustrate the practical utility of the proposed power and sample size computation methods, we applied them to the UK Biobank data (Sudlow et al., 2015;Bycroft et al., 2017), focusing on the understudied participants with African background.Without loss of generality, we chose hypertension as the binary trait of interest.For completeness, we also analyzed (diastolic) blood pressure, a continuous trait to contrast with the binary trait.

Sample and SNP data quality control
We started with the 3,460 participants with self-reported ancestry being African.First, we followed the standard practice (Marees et al., 2018) to filter out individuals with genotype missingness higher than 20 percent.To remove related individuals, we then filtered out individuals with kinship coefficient larger than 0.25, which ended up with 3,182 (approximately) unrelated self-reported Africans.
To account for reporting error and other ancestry biases, we then performed principle component analysis (PCA) using the overall principle components (PCs) provided by UKB (Data-Field 22009).Figure 3(a) shows the first two PCs of all UKB samples, stratified by self-reported Africans and Others, which suggests reporting error.We then applied a K-mean algorithm with K = 4 (Hartigan and Wong, 1979) to the 3,182 unrelated self-reported Africans (Figure 3(b-d)) using the overall PCs provided.Cluster 1 contains 2,601 individuals, 75% of all self-reported African participants (Figure 3(b-d)).Following the common practice, we also computed new PCs using only the 3,182 individuals (Supplementary Figure S1).Among the 2,601 individuals in Cluster 1, we then removed 91 individuals whose new PCs were four standard deviations away from the mean.Thus, the final GWAS sample consists of n = 2,510 (1232 females and 1278 males) unrelated individuals with PC-confirmed African ancestry.
For the genetic data, we started with 784,256 genotyped autosomal SNPs (Data-Field 22418), and we then filtered out SNPs based on the thresholds of HWE p-value<1e-10, MAF< 0.01 and missing rate >0.2.The X chromosome was not included in our analysis due to the recent report of previously unrecognized data quality issue of the X chromosome (Wang et al., 2022).In total, 379,003 common, good quality autosomal SNPs were selected for the subsequent analyses.

GWAS of hypertension and blood pressure
We considered two phenotypes, one binary (hypertension) and the other continuous (diastolic blood pressure).In this application, we only considered their measurements at the initial assessment, as longitudinal data analysis is beyond the scope of this work.There are two measurements of blood pressure during the initial assessment, and we used the average.Additionally, for blood pressure we considered the automated reading instead of the manual reading.Among the n = 2,510 analyzed individuals, the prevalence rate of hypertension is 39.48 percent, and the diastolic blood pressure has mean 85.35 and standard deviation 10.75.
The GWAS of both traits included age and sex as covariates.The analysis of the binary hypertension trait used logistic regression, and the analysis of continuous trait used linear regression as in convention.The two GWAS results are displayed in Figure 4(a) and (b), respectively, for hypertension and blood pressure.Given the small sample size, it is not surprising that none of the SNPs reached the genome-wide significance of 5e-8 (Dudbridge and Gusnanto, 2008).

Power and sample size computation
To illustrate the importance of including covariates in power and sample size computation for a binary trait, as a proof-of-principle, we focused on the top five ranked SNPs from each GWAS.The effect size estimates of both SNPs and covariates were used for the corresponding power and sample size computation, though we recognize the potential issue of winner's curse (Sun et al., 2011); this issue does not change characteristically the conclusion drawn from the methods comparison.
To illustrate the important role of age and sex in study planning of the binary trait of hypertension, we computed powers and required replication sample sizes twice.We first ignore the two covariates as commonly done (without E), which is equivalent to using Quanto; the method of Demidenko (2007) is not applicable as there are two covariates.We then accounted for covariate effects (with E) using the proposed method P1.SS, the default method implemented in SPcompute.Finally, in addition to α = 0.05, the standard significance level used for planning a successful replication study, we also considered 5e-8, the genome-wide significance level to demonstrate the power and sample size needed for the discovery GWAS to achieve 80% power.For comparison, we also analyzed the continuous trait of blood pressure, where the covariate effects are implicitly incorporated through the specification of the residual variance.
For each trait analyzed, the computed powers and for the top SNPs are shown in Figure 5, which are consistent with our simulation results.For the binary hypertension trait (the left panel of Figure 5), the higher blue bars in Figure (a) show that ignoring covariate effects leads to overestimated power of our discovery study (at α = 5e-8); the power for α = 0.05 is close to 100% as expected, thus uninteresting and not shown.The shorter blue bars in Figure (c) show that ignoring covariate effects leads to underestimated discovery sample size (for 80% power at α = 5e-8).Similarly, the shorter blue bars in Figure (e) show that ignoring covariate effects leads to underestimated replication sample size (for 80% power at α = 0.05) when planning a replication study of a binary trait.In contrast, when studying the continuous blood pressure trait, age and sex effects are already implicitly incorporated through the residual variance.Thus, covariate effects β E 's do not have to be explicitly included in the power and sample size computation for a continuous trait.

Discussion
Adjusting for covariate effect is a standard practice in GWAS, but it is rarely done at the study planning stage and in replication studies.When analyzing a continuous trait through linear regression, covariate effects are implicitly accounted for through residual variance.However, when analyzing a binary trait through logistic regression, covariate effects (β E 's) must be explicitly specified and included in power and sample size computation, in addition to the genetic effect of interest (β G ).This phenomenon is known in the statistics literature, but tools available to practitioners are limited.For example, the most well-known software Qaunto does not consider β E unless the power analysis is for GxE interaction analysis (Gauderman, 2002b,a), while the method of Demidenko (2007) only allows for one binary E. In this work, we developed and implemented a flexible software SPCompute for accurate and efficient power and sample size computation for a binary trait.We applied the proposed method to the UK Biobank data, analyzing the binary hypertension trait and simultaneously accounting for age and sex covariate effects in power and sample size computation.We also conducted extensive simulation studies to demonstrate the accuracy and efficiency of the proposed method.
However, there are still several limitations of the proposed method that require future work to address.For example, winner's curse where the effect size estimates of significant SNPs are biased upward is known to be a common problem in GWAS (Sun and Bull, 2005;Zöllner and Pritchard, 2007;Zhong and Prentice, 2010;Sun et al., 2011).Therefore, it would be of interest to investigate how SPCompute accounts for the winner's curse.Another direction of extension would be to account for mis-classification (particularly the control data), which affects the power of an association study (Rekaya et al., 2016;Zhang and Yi, 2020;Lin et al., 2020).Additionally, the proposed framework can be further generalized to accommodate the simultaneous analysis of multiple rare variants (Derkach et al., 2014).Finally, the proposed method assumes a random sample of unrelated individuals.Power and sample size computation for related individuals are worthy future work.
The proposed method, and the default setting of the implemented software SPCompute, assumes all the parameters are specified based on a prospective model as in the UKB application.However, it can also be applied to data collected through the case-control design by modifying the disease prevalence parameter to reflect the case-control ratio used, as shown in Demidenko (2007).Additionally, our extensive simulation studies in Section 4 demonstrated that the proposed method is accurate when parameters were specified based on a prospective model while data were collected through case-control design.However, it should be noted that unlike the regression parameters β G and β E , the covariate space F X can be very different once being conditioned on the case-control ratio (Prentice and Pyke, 1979;Self and Mauritsen, 1988).
Our UKB application in Section 5 only serves as a proof-of-principle and highlights the practical utility of SPCompute, such as its ability to handle both binary and continuous covariates.We made some simplifying assumptions to make the example easier to be understood.For example, we accounted for the covariate effects of age and sex simultaneously by introducing two separate models in the second stage regression of (5).This implicitly assumed the two covariates are conditionally independent given the SNP G, an assumption that might be unrealistic in more complex settings.Finally, the framework of the proposed method can be generalized to incorporate the gene-gene and gene-environment interaction effects, which we will provide as future software updates.d) and (f) on the right panel compare the sample size computation to achieve power of 80% at the significance level of 0.05 across different exp(βG).The red curves are for the 'semi-simulation' method (P1.SS in Section 3.2), blue curves for the 'representative dataset' method (P2.RD in Section 3.3), purple curves for Quanto of Gauderman (2002b), and green and pink curves for the method of Demidenko (2007); in S3, the method of Demidenko (2007) was implemented by dichotomizing E or without considering E).The black cures represent the oracle power and replication sample size estimated empirically.

Figure 1 :
Figure 1: Simulation results for the three scenarios considered in Section 4.1.Scenario 1 (S1) is the retrospective case-control sampling design without E. Scenario 2 (S2) and Scenario 3 (S3) are the prospective sampling design with, respectively, a binary and continuous covariate E. Figures (a), (c) and (e) on the left panel compare the power computation when βG is fixed at log(1.5) (i.e.OR of 1.5, for (a-c)) or log(1.3)(for (e)), and Figures (b), (d) and (f) on the right panel compare the sample size computation to achieve power of 80% at the significance level of 0.05 across different exp(βG).The red curves are for the 'semi-simulation' method (P1.SS in Section 3.2), blue curves for the 'representative dataset' method (P2.RD in Section 3.3), purple curves for Quanto ofGauderman (2002b), and green and pink curves for the method ofDemidenko (2007); in S3, the method ofDemidenko (2007) was implemented by dichotomizing E or without considering E).The black cures represent the oracle power and replication sample size estimated empirically.

Figure 2 :
Figure 2: Figure (a) displays the relationship between the run time in seconds per computation for each method across different sample sizes, using simulation Scenario 2; results for the other two scenarios are characteristically similar.The Proposed 1 is the 'semi-simulation' method (P1:SS) in Section 3.2, The proposed 2 is the 'representative dataset' method (P2:RD) in Section 3.3.Figure (b)  shows that there is a linear relationship between the log 10 standard error (SE) of estimated power and the log 10 number of replicates (B) used for the proposed 'semi-simulation' method 1.

Figure 5 :
Figure 4: GWAS Manhattan results for (a) the binary hypertension trait and (b) the continuous diastolic blood pressure trait, using the African sample (n = 2,510) identified through a PCA analysis of the self-identified African sample of the UK Biobank data as discussed in Section 5.The association analyses included age and sex as important covariates.No SNPs reached genome-wide significance level of 5e-8.

Table 1 :
Demidenko (2007)maximum absolute error (AE), across different sample sizes, between the oracle and computed power using the different methods for the three scenarios considered.P1.SS is the proposed 'semi-simulation' method in Section 3.2, and P2.RD is the proposed 'representative dataset' method in Section 3.3.In Scenario 3, the method ofDemidenko (2007)was implemented by dichotomizing E or without considering E (results*).See legend to Figure1for additional details.