Lack of Identification in Semiparametric Instrumental Variable Models With Binary Outcomes

A parameter in a statistical model is identified if its value can be uniquely determined from the distribution of the observable data. We consider the context of an instrumental variable analysis with a binary outcome for estimating a causal risk ratio. The semiparametric generalized method of moments and structural mean model frameworks use estimating equations for parameter estimation. In this paper, we demonstrate that lack of identification can occur in either of these frameworks, especially if the instrument is weak. In particular, the estimating equations may have no solution or multiple solutions. We investigate the relationship between the strength of the instrument and the proportion of simulated data sets for which there is a unique solution of the estimating equations. We see that this proportion does not appear to depend greatly on the sample size, particularly for weak instruments (ρ2 ≤ 0.01). Poor identification was observed in a considerable proportion of simulated data sets for instruments explaining up to 10% of the variance in the exposure with sample sizes up to 1 million. In an applied example considering the causal effect of body mass index (weight (kg)/height (m)2) on the probability of early menarche, estimates and standard errors from an automated optimization routine were misleading.

With stronger IVs, mean and median estimates across simulations were close to the true parameter value.
As there was no confounding in the data-generating model, systematic weak instrument bias is not expected. However, this limited simulation study should not be interpreted as providing evidence for the behavior of the two-stage method with weak instruments. A more detailed investigation with a more realistic data-generating model incorporating confounding would be needed to make such claims. Such an investigation is beyond the scope of this paper. Bias of the two-stage method with a continuous outcome has been discussed previously [1,2,3], as has the bias with a binary outcome and a logistic model of association [3,4,5]; similar results are expected 1 with a binary outcome and a log-linear model of association. Web Table 1: Mean and median estimates of β 1 = 0.2 and standard deviation (SD) of estimates across simulations from two-stage method with different strengths of instrument as measured by the squared correlation between the instrument and exposure (ρ 2 ) and the mean F statistic, and with different sample sizes. Results are displayed in Web Table 2. This provides another datapoint supporting the hypothesis that the large sample behavior of the MGMM and LGMM estimators depends mostly on the value of ρ 2 . The probability of obtaining a unique solution using the MGMM method with a sample size of 1 000 000 is less than that for a sample size of 5 000 when ρ 2 = 0.005.

MGMM method
LGMM Web Table 2: Percentage of simulated datasets with no solution, one solution (identified), and multiple solutions (lack of identification) from multiplicative and linear generalized method of moments methods with different strengths of instrument as measured by the squared correlation between the instrument and exposure (ρ 2 ) and the mean F statistic, and with large sample size of 1 000 000 individuals. Abbreviations: LGMM, linear generalized method of moments; MGMM, multiplicative generalized method of moments.

Web Appendix 3
Absence of a solution to the estimating equations If the estimating functions are expressed as a vector g(β), then the usual objective function is: where W is a weighting matrix, assumed to be of full rank. This is a quadratic form, and so has a unique minimum. The choice of W affects the efficiency of estimates, but not the consistency [6]. The objective function will be equal to zero if and only if the estimating functions are all zero, hence the simulation study to investigate identification performed in this paper is agnostic to the choice of weighting matrix.
In any case, with a single IV, the choice of weighting matrix is moot, as the first moment condition in the MGMM or LGMM methods can always be equated to zero by altering the intercept term β 0 . Efficient estimates for MGMM with multiple IVs have been discussed previously [7], but are beyond the scope of this paper.
The minimizer of the objective function may be a valid estimate (that is, consistent for the target parameter of interest) even when it does not solve the estimating equations, if the values of the estimating equations at the estimate tend to zero as the sample size tends to infinity [8]. However, there is no guarantee that the solution will be unique.

Web Appendix 4 Supplementary methods for applied example
Here we provide additional details of how the applied analyses in the main paper was undertaken. GMM/SMM analyses were performed using the gmm command Stata 12 [9]; this command minimizes an objective function similar to equation (A.1). The weighting matrix used is derived from a two-step estimation procedure, in which the first step uses the identity weighting matrix to obtain a parameter estimate which is used to construct the second-step weighting matrix [10]. Starting values of (0, 0) were taken and the exposure was centered throughout (this means that the reference value of the exposure where X = 0 was 16.4 kg/m 2 ). A two-stage method was undertaken, using linear regression in the first stage and log-linear (Poisson) regression in the second stage [11]. In the two-stage method, robust standard errors to account for uncertainty in the first-stage regression were not used, as their use decreased standard errors, rather than increasing them. We also assessed the causal effect of the exposure on the outcome by testing for an association between the IV and the outcome using log-linear regression. The assessment of a causal effect requires the weakest assumptions of any of the methods as the test requires no assumption on the distribution of the exposure. The stronger distributional assumptions of the twostage method were satisfied in the simulation study, but cannot be fully tested in the applied example. We examined the distribution of the exposure at various values of the IV; it was close to normal.