## Abstract

We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.

## INTRODUCTION

It is well accepted that the interplay of genetic and environmental factors can form an important part of the biological foundation of common diseases. Some examples of gene–environment (G × E) interactions include the N-acetyltransferase 2 (*NAT2*) gene with smoking in bladder cancer (García-Closas *and others*, 2005) and with red meat consumption in colorectal cancer (Chan *and others*, 2005), the alcohol dehydrogenase type 3 (*ADH3*) gene with moderate alcohol consumption in myocardial infarction (Hines *and others*, 2001) and the serine proteinase inhibitor clade E, member 2 gene (*SERPINE2*) with smoking in chronic obstructive pulmonary disorder (COPD) (Silverman, 2006). Detecting, and more importantly, understanding these G×E interactions thus play a crucial role in solving the puzzle of gene-influenced phenotypic variation. Several study designs have been proposed for discovering these types of associations, and the designs can be broadly characterized as either population- or family-based depending on whether or not study subjects are related.

When considering designs consisting of related subjects, the class of family-based association tests (FBATs; Laird *and others*, 2000) provides strategies for testing genetic effects that are robust to undetected/unaccounted for population substructure. Conditioning on parental/founder genotypes (or the corresponding sufficient statistics when parental genotypes are missing; Rabinowitz and Laird, 2000) insulates these testing strategies from the bias due to ancestry-driven confounding. By relying on *a priori* knowledge regarding the distribution of the offspring genotypes conditional on the parental genotypes, these testing strategies further remain valid when ascertainment conditions are imposed for sample recruitment. This property is lost once a main genetic effect must be estimated, as in the case of testing for G × E or gene–gene (G × G) interactions. Current methodologies assume either (i) no ascertainment condition (i.e. a population sample) (Umbach and Weinberg, 2000), (Yang *and others*, 2000), (Vansteelandt, VanderWeele, *and others*, 2008), (ii) a dichotomous trait of interest (Yang *and others*, 1999), (Liu *and others*, 2002), (Cordell *and others*, 2004), (Lake and Laird, 2004), (Whittemore, 2004), (Chatterjee *and others*, 2005), (Hoffmann *and others*, 2009), (Moerkerke *and others*, 2010), (Wang *and others*, 2011) or (iii) a fully parametric framework (Whittemore, 2004), (Dudbridge, 2008). It is likely that extensions of some of the aforementioned parametric approaches could be powerful alternatives able to handle multiparameter settings when the parametric assumptions are correct. For example, the parental-genotype-robust estimator proposed by Whittemore (2004) could be modified to estimate G×E interaction effects under continuous trait ascertainment schema and, thus, employ a more standard, non-causal-inference–based method.

Our focus here will be on continuous traits. Considering that controlling false-positive rates is a foremost concern in this age of large-scale genetic association studies, we will attempt to avoid parametric assumptions where possible by making use of G-estimation (Robins *and others*, 1992), (Joffe and Brensinger, 2003), (Greenland *and others*, 2008). This is a semiparametric estimation method from the causal inference literature, which underlies the class of FBATs (Vansteelandt, Demeo, *and others*, 2008). In particular, adapting ideas on artificial censoring in G-estimation for structural nested failure time models (Robins and Tsiatis, 1991), (Robins, 2002), we propose a test that maintains robustness to population stratification and correctly accounts for sample ascertainment, while avoiding assumptions on the phenotypic and allele frequency distributions. We assess the performance of this test empirically through extensive simulation studies. To illustrate the approach in practice, we also apply these new techniques to a study of COPD (Demeo *and others*, 2006).

## METHODS

### Notation

We consider a family-based genetic association study and introduce the following notation: *X*_{ij} is a function of the genotype from the *j*th member of the *i*th family (*i* = 1,…,*n*, where *n* is the number of independent families; *j* = 1,…,*m*_{i}, where *m*_{i} is the number of nonfounders in the *i*th pedigree) coded to reflect some particular mode of inheritance (MOI) (e.g. for an assumed additive MOI, *X*_{ij} is the count of risk alleles from the *j*th member of the *i*th family); *𝒮*_{i} is the sufficient statistic for the founder genotypes of the *i*th family (see Rabinowitz and Laird, 2000,); *Y*_{ij} is the disease phenotype of interest; *Z*_{ij} is an environmental covariate; and *A*_{i} is a hypothetical ascertainment indicator which is 1 for families that have been ascertained and 0 otherwise.

### Model

Figure 1 graphically illustrates the causal assumptions for a typical family-based association study using a directed acyclic graph (DAG) (Pearl, 1995), (Pearl, 2000), (Robins, 2001). It encodes, with arrows, all possible direct causal paths between the variables considered in the study. Let us first assume that all parental genotypes have been observed so that *𝒮* encodes the founder genotypes. While we allow for measured and unmeasured covariates *Z* and *U*, respectively, to affect (or be associated with) the founder genotypes, *𝒮*, and the disease phenotype, *Y*, directly, the DAG assumes that they can affect the nonfounder genotype, *X*, only through *𝒮*. This particular assumption underlies the validity of the class of FBATs, even in the presence of population stratification (FBATs; Rabinowitz and Laird, 2000). Note furthermore that the DAG embodies the common G×E independence assumption *X*╨*Z*|*𝒮* (Umbach and Weinberg, 1997), which is often deemed plausible, but has nevertheless been found to be violated in specific studies (Chanock and Hunter, 2008) and should be critically assessed whenever applied.

Under these assumptions, estimating the overall genetic effect, here expressed by the arrow from *X* to *Y*, can be accomplished by stratifying on the founder genotypes 𝒮. This result follows by application of the d-separation rule (Pearl, 2000), (Robins, 2001) upon noting that the correlation between phenotype *Y* and genotype *X* induced by spurious associations from both measured and unmeasured potential confounders is removed by conditioning on 𝒮, so that only the causal effect remains. In what follows, we will formalize this into the assumption that

*Y*(0) denotes the potential/counterfactual response of a given subject under a regime in which

*X*is set to some reference value 0. Because

*X*cannot have an effect on

*Y*(0), the validity of this assumption follows from standard theory for family-based tests (Rabinowitz and Laird, 2000), (Horvath

*and others*, 2004) that, under Mendelian transmission and in the absence of a genetic effect, potential traits and genotypes are independent conditional on 𝒮 (and

*Z*).

Suppose now that the founder genotypes are incompletely observed so that 𝒮 refers to the sufficient statistic for the founder genotypes (Rabinowitz and Laird, 2000). It then follows from Hoffmann *and others* (2009) that the G×E independence assumption continues to hold. It further follows from the detailed arguments in the supplementary materials of Vansteelandt, Demeo, *and others* (2008) that *Y*(0) continues to be conditionally independent of *X*, given 𝒮 (and also given the environmental covariate, *Z*). This motivates our focus on the structural distribution model

*β*encodes the main genetic effect and

*γ*the G×E interaction. Under model (2.2), (2.1) further implies that

*and others*, 1992), (Vansteelandt, Demeo,

*and others*, 2008) exploits this independence restriction by obtaining a consistent estimator for

*β*and

*γ*as the values that solve an estimating equation of the form

*and others*(2008), does not require specification of the form of effects not of direct interest, that is

*Z*and 𝒮.

### Introducing ascertainment criteria

Previous results are no longer valid in the presence of ascertainment (unless for testing the overall genetic effect null hypothesis). This can be seen by adding the ascertainment node *A* to the original causal diagram (see Figure 2 (a) which includes the situation where families are selected based on a proband's phenotype) and noting that the analysis is, by design, conditional on *A*. Indeed, stratifying on the “child” *A* of a collider *Y* modifies the association between 𝒮 and *X*. It follows that

*A*(see also Section 2.4).

A strategy to address this problem, suggested to us by James Robins, is to evaluate the estimating equation (2.3) only in subjects who would have been ascertained regardless of genotype. Related ideas have been used in G-estimation for structural nested failure time models (Robins and Tsiatis, 1991), (Robins, 2002), where the observed event times are typically artificially “recensored.” To be precise, let *A*(*x*) denote the counterfactual ascertainment status of a given subject under a regime in which *X* is set to *x*. Then we will evaluate the estimating equation only in subjects with *A*(*x*) = 1 for all *x*. This is informally depicted in Figure 2 (b), where *A*^{*} = 1 for subjects with *A*(*x*) = 1 for all *x* (0 otherwise); from Figure 2 (b), we can see that *Y*(0) ╨*X*|*Z*,*𝒮*,*A*^{*} = 1, *X* ╨*Z*|*𝒮*,*A*^{*} = 1, and *A*^{*}╨*X*|*𝒮*, suggesting that the previous G-estimation principle remains applicable to this subset. A more formal development, which shows that application of G-estimation to this subset continues to infer the population genetic effect, is given in the next section.

### G×E interaction testing

Throughout, we will consider an outcome-dependent ascertainment criterion such that families are selected for a study based on a single proband within the family having an extreme phenotype. Specifically, a proband is randomly chosen and the corresponding family is selected if and only if that proband's phenotype is below some known value *y*, so that individuals with phenotypes below *y* are identified and family members of these probands are then also recruited. This strategy is often part of a study's eligibility criteria, for example using methacholine reactivity for asthma (Childhood Asthma Management Program, 1999), albumin–creatinine ratio for diabetic nephropathy (Mueller *and others*, 2006), and body mass index for obesity (Hinney *and others*, 1997). As an example, in the COPD study motivating the methodology, single probands with severe COPD were identified and, subsequently, all available relatives of this single proband were recruited such that the entire family's recruitment is solely contingent on the original proband.

For notational convenience, we will identify the selected proband with subject index *j* = 1 and define the family ascertainment indicator to be *A*_{i} = *f*(*Y*_{i1}) = *I*(*Y*_{i1} ≤ *y*). Under the null hypothesis of no G×E interaction (i.e. *γ* = 0), we then propose to estimate the main genetic effect β by solving an equation of the form

*β*, where we define

*A*

_{ij}

^{*}(

*β*) = 1 if

*f*{

*Y*

_{ij}+

*β*(

*x*−

*X*

_{ij})} = 1∀

*x*and 0 otherwise, and due to the single proband ascertainment schema considered here,

*A*

_{i}

^{*}(

*β*) =

*A*

_{i1}

^{*}(

*β*) (see supplementary material available at http://www.biostatistics.oxfordjournals.org for alternative definitions of

*A*

_{i}

^{*}(

*β*)).

Note that *A*_{i}^{*}(*β*) = 1 for family *i* implies in particular that *A*_{i} = 1, so that this estimating equation is effectively selecting a subset of the observed ascertained sample. The validity of this approach requires the following generalization of assumption (2.1):

*γ*= 0, (2.4) is a mean-zero estimating equation because

*I*{

*A*

_{i}

^{*}(

*β*) = 1}(

*Y*

_{ij}−

*β*

*X*

_{ij}) is a function of {

*Y*

_{ij}−

*β*

*X*

_{ij};

*j*= 1,…,

*m*

_{i}}, which is independent of

*X*

_{ij}, conditional on

*Z*

_{ij}and

*𝒮*

_{i}by (2.5) and because

*E*(

*X*

_{ij}|

*Z*

_{ij},

*𝒮*

_{i}) =

*E*(

*X*

_{ij}|

*𝒮*

_{i}). We can then evaluate the following score test statistic for G×E interaction, bad hbox(

*β*):

*β*replaced with $\beta ^$, the solution to (2.4). Note that we choose to focus on score tests because it enables setting

*γ*= 0, which is computationally attractive and tends to limit the deletion of families from the analysis.

Because the indicator functions *I*{*A*_{i}^{*}(*β*) = 1} depend on the genetic effect size, the estimating function for *β*, that is

*β*. The usual Taylor series arguments for M-estimators, which are needed to adjust the G×E interaction test for the uncertainty regarding $\beta ^$, are no longer applicable. Assuming that

_{γβ}and Γ

_{ββ}, we have that

_{γβ}and Γ

_{ββ}can be estimated as the empirical means of −

*Z*

_{ij}

*I*{

*A*

_{i}

^{*}(

*β*) = 1}

*X*

_{ij}{

*X*

_{ij}−

*E*(

*X*

_{ij}|

*𝒮*

_{i})} and −

*I*{

*A*

_{i}

^{*}(

*β*) = 1}

*X*

_{ij}{

*X*

_{ij}−

*E*(

*X*

_{ij}|

*𝒮*

_{i})}, respectively, evaluated at $\beta ^$. More generally, one may use the least squares–based resampling approach proposed by Zeng and Lin (2008), which we adopted in the simulation studies and data analysis. Briefly, this involves assessing how fluctuations in $\beta ^$ induce fluctuations in the scores $Uij,\beta (\beta ^)$ and $Uij,\gamma (\beta ^)$, and then using least squares regression to summarize these associations.

*B*realizations from a random, mean zero vector are generated and denoted by

*Z*

_{1},…,

*Z*

_{B}. $\Gamma ^\beta \beta $ is then calculated as the least squares estimate from regressing $n\u22121/2Uij,\beta (\beta ^+n\u22121/2Zb)(b=1,\u2026,B)$ on

*Z*

_{b}(

*b*= 1,…,

*B*). Similarly, $\Gamma ^\gamma \beta $ is the least squares estimate from regressing $n\u22121/2Uij,\gamma (\beta ^+n\u22121/2Zb)(b=1,\u2026,B)$ on

*Z*

_{b}(

*b*= 1,…,

*B*). From (2.7), the asymptotic variance of the test statistic $n\u22121/2\u2211ijUij,\gamma (\beta ^)$ can now be evaluated as the empirical variance of $Uij,\gamma (\beta ^)\u2212\Gamma ^\gamma \beta \Gamma ^\beta \beta \u22121Uij,\beta (\beta ^)$ with $\beta ^,\Gamma ^\gamma \beta $ and $\Gamma ^\beta \beta $ held fixed. We now have a test of G×E interaction by comparing the ratio of $n\u22121/2\u2211ijUij,\gamma (\beta ^)$ and its standard deviation to a standard normal distribution. This test avoids distributional assumptions while taking into account the imposed ascertainment conditions.

## SIMULATION STUDIES

In each replicate of the simulation study (10 000 replicates per parameter combination), we simulate a single nucleotide polymorphism (SNP) and an environmental covariate for 2000 trios by first simulating parental genotypes and assuming Mendelian transmissions to the proband. The offspring trait, *Y*, is generated based on the model: *Y*_{i} = *β*_{X}*X*_{i} + *Z*_{i} + *β*_{X×Z}(*X*_{i}*Z*_{i}) + ε_{i}, where *X*_{i}∈{0,1,2} (reflecting an additive MOI), *Z*_{i}∼*N*(0,1) and ε_{i}∼*N*(0,1). The effect sizes are then calculated based on the specified heritability for the main effect of genotype (*h*_{X}^{2}) and the G×E effect (*h*_{X×Z}^{2}). Simulated traits are compared against an ascertainment rule (e.g. *Y* > *y*_{0.90}, where *y*_{p} for *p*∈[0,1] denotes the *p*100% percentile of the distribution of *Y*), and this is repeated until 2000 “ascertained” trios have been generated. Figures 3(a)–4(b) display results from scenarios with various minor allele frequencies (MAFs; *p* = 5–50%), ascertainment rules (*Y* > *y*_{a} with *a* = 0.50 and 0.90) and heritabilities (*h*_{X}^{2} & *h*_{X×Z}^{2}, each either 0 or 2.5%). Note that the notion of “extreme phenotype” in the simulations defines ascertainment through the upper tail of the trait distribution. The performance of the proposed method is independent of the direction of the ascertainment condition. For each ascertainment rule examined, the pair of supporting figures consists of (a) type I error rates (Figure 3) for the test of G×E interaction (i.e. under the interaction null of *h*_{X×Z}^{2} = 0), first without a main effect (*h*_{X}^{2} = 0) and then in the presence of a main genetic effect (*h*_{X}^{2} = 2.5%) and (b) powers (Figure 4) assuming *h*_{X×Z}^{2} = 2.5%, again displaying panels both with and without a main genetic effect.

Studies recruiting only probands with traits above the population median (i.e. using an ascertainment rule of *Y* > *y*_{0.50}) incur a mild inflation of type I error when a genetic main effect is present when employing QBAT-I (Vansteelandt, Demeo, *and others*, 2008), the test statistic that ignores the ascertainment conditions (Figure 3(a)). This inflation is most pronounced at lower MAFs. For example, the most severe inflation observed in the simulations was for a SNP with a MAF of 5% (type I error rate of 7.1% using a nominal *α* = 5%). Conversely, *S*_{G×E} is mildly conservative when a main genetic effect is present. Both test statistics maintain the nominal significance level under the complete null hypothesis of no genetic effect. Under this ascertainment schema, however, QBAT-I is uniformly more powerful than *S*_{G×E} (although powers are difficult to compare when type I error rates are inflated) and, other than at low MAF, power is a decreasing function of MAF (Figure 4(a)).

When trait ascertainment is more stringent, the consequences of failing to properly account for it are more evident. Figures 3(b) and 4(b) display the type I error rates and powers under an ascertainment rule of *Y* > *y*_{0.90}. Here, we see that both tests preserve the nominal significance under the complete null hypothesis; however, when there is a main genetic effect, the bias observed using QBAT-I is quite pronounced, here increasing as MAF does, while *S*_{G×E} remains mildly conservative (Figure 3(b)). Interestingly, other than with rare SNPs and in the presence of a main genetic effect, *S*_{bad hbox} is consistently more powerful than QBAT-I, and confers a 30% improvement in power for many MAFs. This is surprising in the absence of a main genetic effect, in which case also QBAT-I is valid. As with median-based ascertainment, power is a decreasing function of MAF; this suggests that ascertainment strategies can be more useful when applied to rarer SNPs. Also of note is that, curiously, when examining a SNP with a main effect and a MAF of 20% or above, the empirical rejection rates for QBAT-I are actually higher under the interaction null hypothesis (*h*_{X×Z}^{2} = 0) than the alternative (*h*_{X×Z}^{2} = 2.5%). This clearly shows the importance of employing an analytic strategy that is appropriate for the given study design.

### Comparison to a likelihood-based approach

We additionally examine a strategy that posits a truncated normal distribution for the offspring trait. Maximum likelihood estimates are calculated from the model *Y*_{i} = *β*_{0} + *β*_{X}*X*_{i} + *β*_{EX}*E*(*X*_{i}|*𝒮*_{i}) + *β*_{Z}*Z*_{i} + *β*_{X×Z}(*X*_{i}*Z*_{i}) + ε_{i} conditional on *Y*_{i} > *y*_{0.90}. A standard test for *H*_{0}:*β*_{X×Z} = 0 is then used. We find that the likelihood approach confers more power than the proposed method but, similar to QBAT-I, suffers from inflated type I error under a variety of scenarios. This bias is likely finite-sample bias due to the difficulties inherent in using only a tail of the normal distribution to estimate the rest of the distribution. See Section S1.1 of the supplementary material (available at Biostatistics online) for more detailed simulation results.

### Introducing misspecification

In all above simulations, we assume the data-generating mechanism to be known; however, it is often the case that components of the mechanism are not known. We consider the effects of (i) misspecifying the MOI, for example testing with the assumption of an additive MOI when in fact the genetic effect is dominant, (ii) misspecifying the offspring trait distribution, and (iii) misspecifying the form of the interaction effect. To assess the effects of a misspecified MOI, we simulate as before and vary the true MOI and that assumed for testing. Our proposed approach remains valid except when data are generated with a recessive MOI and the minor allele is rare (e.g. 5%). Empirical power is not greatly affected when testing with a dominant MOI when the true MOI is additive and vice versa; however, testing assuming a recessive MOI results in much loss of power unless the true MOI is in fact recessive. Results are provided in Section S1.2.1 of the supplementary material (available at Biostatistics online).

We also simulate scenarios with offspring trait errors generated from a Gamma(1,1) distribution (i.e. ε_{i}∼Γ(1,1)) and others where the environmental covariate enters the interaction as a squared term, *β*_{X×Z}(*X*_{i}*Z*_{i}^{2}). Our approach remains valid under these types of specification, and bias is amplified for both the QBAT-I and likelihood approaches. These results are displayed in Section S1.2.2 in the supplementary material (available at Biostatistics online).

## APPLICATION TO A CHRONIC OBSTRUCTIVE PULMONARY DISEASE STUDY

As in Vansteelandt, Demeo, *and others* (2008), we examine the Boston Early-Onset COPD Study of Silverman *and others* (1998). We test the same 6 SNPs located in the *SERPINE2* gene previously found to be in a linkage peak for COPD (Demeo *and others*, 2006). The study consists of 128 extended pedigrees with a maximum family size of 27 members, each ascertained from a single proband satisfying (i) forced expiratory volume at 1 s (FEV_{1}) less than 40% of predicted, (ii) less than 53 years of age, and (iii) no evidence of severe alpha 1-antitrypsin deficiency. Because the study's primary phenotype is FEV_{1}, we apply the new G×E interaction testing strategy as it relates to the first ascertainment criterion because the remaining ascertainment conditions relate to baseline covariates and can henceforth be ignored (when the model is stated conditional on these covariates). That is to estimate the main genetic effect of FEV_{1}, we place the restriction that the proband from each pedigree have FEV_{1} < 40% of predicted regardless of the proband's true genotype as detailed in Section 2.4. The remainder of each proband's extended family is included in the analysis only when this criterion is satisfied.

Smoking is the main risk factor for COPD. As such, it is the environmental covariate investigated in the G×E interaction analysis. Two measures of smoking are used here: a binary version that differentiates subjects who have ever smoked before (ever-smokers) from those who have not (never-smokers); and a continuous version that serves as a proxy for lifetime exposure, pack years (one pack year is defined as smoking one pack of cigarettes per day for 1 year). Consistent with previous analyses of this data, we assume a dominant MOI for each SNP and a null hypothesis of linkage and no association as the tested SNPs were previously found to be under a linkage peak. There is no evidence to support violation of the G×E conditional independence assumption in these data, and an ongoing study corroborates this by finding no nominally significant associations between the investigated SNPs and various smoking covariates (data not shown).

Table 1 displays the results of this analysis employing both the new test, *S*_{bad hbox}, and QBAT-I (Vansteelandt, Demeo, *and others*, 2008). For each SNP, the *p* values are given for both tests. The new approach yields less significant results from ser8, but a potentially more powerful test when considering ser51 and the quantitative phenotype pack years (*S*_{bad hbox}-*p* = 0.047 vs. QBAT-I-*p* = 0.082). These findings are consistent with those of Vansteelandt, Demeo, *and others* (2008) that indicated a haplotype-smoking interaction for haplotypes containing the ser51 SNP and also that the quantitative phenotype pack years is more powerful than its binary counterpart.

Markers | Pack years | Ever-smoker | ||

S_{G × E}-p | QBAT-I-p | S_{G × E}-p | QBAT-I-p | |

ser37 | 0.118 | 0.126 | 0.152 | 0.163 |

ser8 | 0.662 | 0.243 | 0.951 | 0.958 |

ser51 | 0.047 | 0.082 | 0.455 | 0.816 |

ser55 | 0.209 | 0.244 | 0.703 | 0.774 |

ser50 | 0.213 | 0.271 | 0.745 | 0.825 |

ser6 | 0.232 | 0.249 | 0.870 | 0.941 |

Markers | Pack years | Ever-smoker | ||

S_{G × E}-p | QBAT-I-p | S_{G × E}-p | QBAT-I-p | |

ser37 | 0.118 | 0.126 | 0.152 | 0.163 |

ser8 | 0.662 | 0.243 | 0.951 | 0.958 |

ser51 | 0.047 | 0.082 | 0.455 | 0.816 |

ser55 | 0.209 | 0.244 | 0.703 | 0.774 |

ser50 | 0.213 | 0.271 | 0.745 | 0.825 |

ser6 | 0.232 | 0.249 | 0.870 | 0.941 |

## DISCUSSION

The current era of large-scale genetics has seen many successes from conducting genome-wide association studies (GWAS) of complex diseases; to date, novel SNPs for well over 100 disease traits have been discovered using this approach (Hindorff *and others*, 2010). The so-called problem of “missing heritability,” however, has caused some to criticize GWAS (Goldstein, 2009). Others have been more optimistic and have addressed the manners in which current and future GWAS can be used to uncover more of the genetic predisposition for complex diseases (McCarthy and Hirschhorn, 2008), (Manolio *and others*, 2009); not the least among the suggested strategies are using family samples and exploring G×G and G×E interactions. It is integral that appropriate, powerful, and unbiased statistical methodologies be designed for detecting G×E interaction with quantitative disease traits in this context of large-scale analysis as well as in smaller studies.

We have proposed a novel test for G×E interaction that does not require any modeling assumptions for the phenotype distribution, appropriately accounts for outcome-dependent ascertainment, is robust to confounding due to population substructure, and allows for incomplete parental genotypes. Our results support the use of this method for any family study investigating G×E interactions that employs an ascertainment criterion based on a quantitative trait provided that there is support for the G×E conditional independence assumption (or, at the least, no strong evidence to that it fails to hold). If this assumption is implausible, one could in principle derive the distribution of *X*|*Z*,*𝒮* utilizing the known law of *X*|*𝒮* and an application of Bayes' rule in order to modify our approach. However, this would require modeling assumptions that have been intentionally avoided in the proposed method. By focusing on estimation of a scalar parameter, we additionally avoid problems associated with the potential for identifying multiple solutions (Joffe *and others*, 2011).

A potential limitation of the proposed approach is that when the genotype coding can take on many levels (e.g. when *X* refers to a haplotype coding), then the set of families who would have been ascertained regardless of their genotype may be small, in which case the proposed test might have very limited power. This can be partially remedied by restricting the focus to a smaller set *K*_{x} with bounded support, containing a subset of the observed genotypes (e.g. haplotypes satisfying some threshold of observed frequency). In particular, we can redefine *A*_{ij}^{*}(*β*) = 1 if *f*{*Y*_{ij} + *β*(*x* − *X*_{ij})} = 1∀*x*∈*K*_{x} and 0 otherwise, and modify the estimating equation (2.4) for *β* to

## SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

## FUNDING

National Institutes of Health [NCRR] (P20RR020145 and 5P20RR016481-10 to D.W.F.); IAP research network (P06/03 from the Belgian government [Belgian Science Policy] and Ghent University [Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”] to S.V.). *Conflict of Interest:* None declared.

We are very grateful to Jamie Robins for suggesting the potential use of artificial censoring and for many useful discussions and suggestions. We also thank two anonymous reviewers for insightful comments that improved the manuscript.

## References

*SERPINE2*gene is associated with chronic obstructive pulmonary disease