Power calculator for instrumental variable analysis in pharmacoepidemiology

Abstract Background Instrumental variable analysis, for example with physicians’ prescribing preferences as an instrument for medications issued in primary care, is an increasingly popular method in the field of pharmacoepidemiology. Existing power calculators for studies using instrumental variable analysis, such as Mendelian randomization power calculators, do not allow for the structure of research questions in this field. This is because the analysis in pharmacoepidemiology will typically have stronger instruments and detect larger causal effects than in other fields. Consequently, there is a need for dedicated power calculators for pharmacoepidemiological research. Methods and Results The formula for calculating the power of a study using instrumental variable analysis in the context of pharmacoepidemiology is derived before being validated by a simulation study. The formula is applicable for studies using a single binary instrument to analyse the causal effect of a binary exposure on a continuous outcome. An online calculator, as well as packages in both R and Stata, are provided for the implementation of the formula by others. Conclusions The statistical power of instrumental variable analysis in pharmacoepidemiological studies to detect a clinically meaningful treatment effect is an important consideration. Research questions in this field have distinct structures that must be accounted for when calculating power. The formula presented differs from existing instrumental variable power formulae due to its parametrization, which is designed specifically for ease of use by pharmacoepidemiologists.


Introduction
Pharmacoepidemiological studies risk irrelevance if they are insufficiently powered to detect clinically meaningful treatment effects. Before starting a study, the statistical power to calculate a given treatment effect can be calculated. This type of calculation is becoming increasingly important for grant and data request applications, which look to value the contribution of such studies.
The number of pharmacoepidemiology studies using instrumental variable analysis, for example with physicians' prescribing preferences as an instrument for exposure, continues to grow. [1][2][3][4][5][6] This is partly because instrumental variable analyses have the potential to overcome some of the issues associated with conventional statistical approaches, such as residual confounding and reverse causation. As the demand to provide power calculations to support applications increases, there is a more pressing need to be able to provide power calculations for this method.
There are power calculators for instrumental variable analysis in other settings, such as Mendelian randomization, which uses germline genetic variants as proxies for exposures in disease-related research. 7,8 However, pharmacoepidemiological research questions have distinct structures that are not sufficiently catered for by these existing calculators. Unlike Mendelian randomization studies, which often use a case-control study design, pharmacoepidemiology studies typically use a cohort study design. Further to this, pharmacoepidemiology studies usually report a risk difference for a binary exposure using a binary instrument, whereas Mendelian randomization studies report on a continuous exposure using a discrete or continuous genetic instrument (count of alleles or allele score respectively). As a result of these differences, as well as the stronger instruments and larger causal effects seen in pharmacoepidemiology, there is a need for a dedicated power calculator for instrumental variable analysis in the context of this field. This paper will address how to conduct power calculations for pharmacoepidemiological studies using a single binary instrument to analyse the causal effect of a binary exposure on a continuous outcome. The formula to calculate power will be derived and then validated by a simulation study. The formula is distinct from existing instrumental variable power formulae due to its parametrization, which is designed specifically for ease of use by pharmacoepidemiologists. An online calculator, as well as packages in both R and Stata, are provided for the implementation of the formula by others.

Methods and Results
Let us consider physicians' prescribing preferences for two different treatments-for example a treatment of interest and a control treatment-as an instrument for exposure to these treatments. Physicians' preferences are generally not directly observable, so each physician's prescriptions to previous patients are used as a proxy for their preferences. This results in a binary instrument that takes a value of one if the physician issued a prescription for the treatment of interest to their previous patient and a value of zero if they prescribed the control treatment. We will derive the formula for the power of studies that use this instrument to measure the causal effect of a drug exposure on a continuous outcome, for example systolic blood pressure or lowdensity lipoprotein cholesterol.

Formula derivation
The instrumental variable analysis we consider requires the following three variables; namely a binary instrument Z, a binary exposure X and a continuous outcome Y. The outcome for patient i, for i ¼ 1; . . . ; n, is modelled as follows: where U i is a zero-mean error term containing unobserved confounders, determining both the outcome Y i and the treatment X i . The instrument Z i affects treatment X i , but is not associated with the unobserved confounders and has no direct effect on the outcome.

Key Messages
• Research questions using instrumental variable analysis in pharmacoepidemiology have distinct structures that have previously not been catered for by instrumental variable analysis power calculators.
• Power can be calculated for studies using a single binary instrument to analyse the causal effect of a binary exposure on a continuous outcome in the context of pharmacoepidemiology using the presented formula, an online power calculator or packages available for use in both R and Stata.
• The use of this power calculator will allow investigators to determine whether a pharmacoepidemiology study is likely to detect clinically meaningful treatment effects before the study's commencement.
where Y , X and Z are sample averages. Denote byỹ,x andz the n-vectors of observations onỸ i ,X i andZ i , respectively. The two-stage least squares (2SLS) estimator of b is then given by The variance of the 2SLS estimator is: where Pz ¼zðz 0z Þ À1z0 and r 2 ¼ EðU 2 i Þ is the residual variance. Note that conditional homoscedasticity holds, so the variance is constant for all values of the instrument i.e.
Consider the termx 0 Pzx: Hencex 0 Pzx can be presented in the following way: Now consider the instrumental variable estimator of b.
Using the asymptotic distributionb $ N b; r 2 ðx 0 PzxÞ À1 , the distribution of the t-test statistic under the null hypoth- The distribution of the test statistic under the alternative hypothesis The null hypothesis is rejected if jtj > c a where c a is the critical value at significance level a.
The power is the probability that the test statistic will exceed the critical value, which is: where UðsÞ is the cumulative standard normal distribution function evaluated at s. Power therefore increases as the value of r decreases and/or the value ofx 0 Pzx increases. By substitutingx 0 Pzx and simplifying, we obtain the following formula for power: The formula requires a total of seven parameters to be specified. This includes four parameters that must always be specified-these are the significance level, a; the size of the causal effect, d; the residual variance, r 2 ¼ EðU 2 i Þ; and the sample size, n. Also three that can be chosen from the following four parameters-these are the frequency of the instrument, p Z ¼ PðZ ¼ 1Þ; the frequency of exposure, p X ¼ PðX ¼ 1Þ; the probability of exposure given the instrument Z ¼ 1, p XZ ¼ PðX ¼ 1jZ ¼ 1Þ; and the probability of exposure given the instrument Z ¼ 0, p XZ ¼ PðX ¼ 1jZ ¼ 0Þ. The chosen parameters must be specified so that the following holds: The formula for power is available for use via an online calculator [https://venexia.shinyapps.io/PharmIV/] and packages for R and Stata can be downloaded from GitHub [https://github.com/venexia/PharmIV].
Note that the frequency of exposure in an instrumental variable analysis of this type is likely to be higher than in a general population study because a drug is compared against one or more other drugs in a population of people with the indication for these treatments. General population studies, on the other hand, tend to compare a population who received the drug of interest with a population who did not receive it, and consequently the frequency of exposure is generally much lower. The effect of varying the parameters within the formula on a study's power is best presented graphically. Figure 1 illustrates an example of the effect of the frequency of the exposure p X ¼ PðX ¼ 1Þ on the power of a study to detect a causal effect of d ¼ À 0:150 using an instrument with a frequency of p Z ¼ 0:200, a residual variance of r 2 ¼ 1 and a sample size of up to 30 000 participants. Both increasing the frequency of exposure up to 50% and increasing the sample size results in increased power for this study.

Formula validation
To validate the power formula, we conducted a simulation. We simulated the data by defining the three variables necessary to conduct instrumental variable analysis with a single instrumental variable as follows: PðX ¼ 1jZ ¼ jÞ for j ¼ 0; 1 are the inverse cumulative standard normal distribution, or quantile, functions of the conditional probabilities of exposure given the instrument, d is the causal effect, and U i and V i are standard normally distributed error terms with covariance q.
The formula uses a binary instrument, binary exposure and continuous outcome and so the above variables were simulated to recreate data of this form. The instrument Z is modelled by a binomial distribution parameterized by its frequency p Z ¼ PðZ ¼ 1Þ. This ensures a binary variable with the correct probability of success. The exposure X is also binary but is modelled using a threshold model. The variability in the equation for the exposure comes from the normally distributed error term V i . The use of the model equation allows the exposure X to be associated with the instrument Z. The outcome Y is modelled by its model equation Y i ¼ dX i þ U i . In the model, the instrument is valid as the outcome Y is only associated with the exposure X; as dictated by the causal effect d, and is not associated with the instrument Z other than through the exposure X.
Using the generated data, we performed an instrumental variable analysis using the command IVREG2 in Stata. 9 From this analysis, we recorded the coefficient of the exposure X with the 95% confidence interval. We then counted the number of simulations for which the confidence interval excluded the null, and divided this by the total number of simulations to determine the power. By running the simulation and calculating the formula using the same parameters, we are able to validate the formula against the simulation.
We present the power calculated from both the simulation and the formula for several parameter combinations in Table 1. The table contains 27 different simulations and each was repeated 10 000 times. The simulations consider each combination of three values of the frequency of exposure, p X ¼ 0:100; 0:250; 0:500; three values of the probability of exposure given the instrument Z ¼ 1, p XZ ¼ 0:150; 0:300; 0:450; and three values of the sample size, N ¼ 10000; 20000; 30000. We set the frequency of the instrument, p Z ¼ 0:200; the causal effect, d ¼ À0:150; the residual variance, r 2 ¼ 1; and calculated PðX ¼ 1jZ ¼ 0Þ according to the following equation: The effect of confounding was removed as a parameter because the power was insensitive to its value in the simulation setting. Details of the simulations conducted to test this can be found in Supplementary File 1, available as Supplementary data at IJE online. The Stata code for this paper, including that used to create the simulation, is available from GitHub [https://github.com/venexia/PharmIV].

Simulation results
The formula and the simulation consistently provide similar results, with an absolute mean difference of 0:4% for the parameter combinations presented in Table 1. There is also no discernible pattern in the differences, suggesting systematic bias is not present. Further to this, the power is consistent with its behaviour in other established power calculations. For example, increasing sample size universally improves power for all parameter combinations.

Discussion
In this paper, we have derived the formula necessary to calculate power for instrumental variable analysis with a single binary instrument, binary exposure and continuous outcome in the context of pharmacoepidemiology. The formula has been shown to be valid by comparison against a simulation study, which concluded that the formula provided near true values across a range of realistic parameters. We acknowledge that there is some overlap of this calculator with existing calculators such as that proposed by Burgess for Mendelian randomization. 7 Although both calculators ultimately have a shared aim, namely to calculate the power of a study using an instrumental variable analysis, their application makes the calculators distinct. This is evident from the choice of parameterization of the power formula. The Mendelian randomization calculator opts for the coefficient of determination, R 2 . This is a natural choice for the application, as it summarizes the proportion of the variance expected to be explained by genetic factors. In contrast, we have opted to parameterize our calculator for use in pharmacoepidemiological studies in terms of the frequency of the instrument, the frequency of the exposure and the conditional probabilities that relate them. This is more intuitive to a pharmacoepidemiological audience who will typically use the proportion of patients exposed, i.e. the frequency of exposure. In addition to this, the instruments typically used in this framework-for example, physicians' prescribing preference-do not necessarily fit as naturally to the notion of variance explained and are summarized much more easily by their frequency and their relationship with exposure.
A concern for any instrumental variable analysis, whether in the context of pharmacoepidemiology or not, is the strength of the instrument. Instruments are termed weak when the correlation between the instruments and the exposure is low. 10 A commonly cited threshold is a partial F statistic of the association between the instrument and the exposure of less than 10. 8,11 Weak instruments will result in low power to detect a causal effect. [12][13][14] They are also known to induce bias, as such instruments may explain only a small proportion of the association between the exposure and outcome. Therefore, although pharmacoepidemiological studies are likely to have stronger instruments than other forms of instrumental variable analysis such as Mendelian randomization, researchers should remain mindful of their choice of instrument and whether it is appropriate for the research question they wish to study.
As for any power formula, the formula presented here is limited by its parameters, which simplify the dataset being considered. Power calculated from such formulae cannot account for dataset characteristics outside these parameters. For example, the formula makes no allowance for the presence of missing data-a known limiting factor in the power of a study. By allowing for missing data in the anticipated sample size, conservative estimates for the power of a study can be obtained using the formula presented. Further work is needed in order to establish the formula for power in other scenarios that use instrumental variable analysis within a pharmacoepidemiology context. This includes analyses with binary outcomes and analyses that involve multiple instrumental variables.
As the use of instrumental variable analysis in pharmacoepidemiology becomes more commonplace, there is an increasing need to provide power calculations for studies using this type of analysis. To provide such information, accessible and accurate power formulae need to be made Conflict of interest: None to declare.