## Abstract

We propose a method for testing gene–environment (G × E) interactions on a complex trait in family-based studies in which a phenotypic ascertainment criterion has been imposed. This novel approach employs G-estimation, a semiparametric estimation technique from the causal inference literature, to avoid modeling of the association between the environmental exposure and the phenotype, to gain robustness against unmeasured confounding due to population substructure, and to acknowledge the ascertainment conditions. The proposed test allows for incomplete parental genotypes. It is compared by simulation studies to an analogous conditional likelihood–based approach and to the QBAT-I test, which also invokes the G-estimation principle but ignores ascertainment. We apply our approach to a study of chronic obstructive pulmonary disorder.

## INTRODUCTION

It is well accepted that the interplay of genetic and environmental factors can form an important part of the biological foundation of common diseases. Some examples of gene–environment (G × E) interactions include the N-acetyltransferase 2 (NAT2) gene with smoking in bladder cancer (García-Closas and others, 2005) and with red meat consumption in colorectal cancer (Chan and others, 2005), the alcohol dehydrogenase type 3 (ADH3) gene with moderate alcohol consumption in myocardial infarction (Hines and others, 2001) and the serine proteinase inhibitor clade E, member 2 gene (SERPINE2) with smoking in chronic obstructive pulmonary disorder (COPD) (Silverman, 2006). Detecting, and more importantly, understanding these G×E interactions thus play a crucial role in solving the puzzle of gene-influenced phenotypic variation. Several study designs have been proposed for discovering these types of associations, and the designs can be broadly characterized as either population- or family-based depending on whether or not study subjects are related.

When considering designs consisting of related subjects, the class of family-based association tests (FBATs; Laird and others, 2000) provides strategies for testing genetic effects that are robust to undetected/unaccounted for population substructure. Conditioning on parental/founder genotypes (or the corresponding sufficient statistics when parental genotypes are missing; Rabinowitz and Laird, 2000) insulates these testing strategies from the bias due to ancestry-driven confounding. By relying on a priori knowledge regarding the distribution of the offspring genotypes conditional on the parental genotypes, these testing strategies further remain valid when ascertainment conditions are imposed for sample recruitment. This property is lost once a main genetic effect must be estimated, as in the case of testing for G × E or gene–gene (G × G) interactions. Current methodologies assume either (i) no ascertainment condition (i.e. a population sample) (Umbach and Weinberg, 2000), (Yang and others, 2000), (Vansteelandt, VanderWeele, and others, 2008), (ii) a dichotomous trait of interest (Yang and others, 1999), (Liu and others, 2002), (Cordell and others, 2004), (Lake and Laird, 2004), (Whittemore, 2004), (Chatterjee and others, 2005), (Hoffmann and others, 2009), (Moerkerke and others, 2010), (Wang and others, 2011) or (iii) a fully parametric framework (Whittemore, 2004), (Dudbridge, 2008). It is likely that extensions of some of the aforementioned parametric approaches could be powerful alternatives able to handle multiparameter settings when the parametric assumptions are correct. For example, the parental-genotype-robust estimator proposed by Whittemore (2004) could be modified to estimate G×E interaction effects under continuous trait ascertainment schema and, thus, employ a more standard, non-causal-inference–based method.

Our focus here will be on continuous traits. Considering that controlling false-positive rates is a foremost concern in this age of large-scale genetic association studies, we will attempt to avoid parametric assumptions where possible by making use of G-estimation (Robins and others, 1992), (Joffe and Brensinger, 2003), (Greenland and others, 2008). This is a semiparametric estimation method from the causal inference literature, which underlies the class of FBATs (Vansteelandt, Demeo, and others, 2008). In particular, adapting ideas on artificial censoring in G-estimation for structural nested failure time models (Robins and Tsiatis, 1991), (Robins, 2002), we propose a test that maintains robustness to population stratification and correctly accounts for sample ascertainment, while avoiding assumptions on the phenotypic and allele frequency distributions. We assess the performance of this test empirically through extensive simulation studies. To illustrate the approach in practice, we also apply these new techniques to a study of COPD (Demeo and others, 2006).

## METHODS

### Notation

We consider a family-based genetic association study and introduce the following notation: Xij is a function of the genotype from the jth member of the ith family (i = 1,…,n, where n is the number of independent families; j = 1,…,mi, where mi is the number of nonfounders in the ith pedigree) coded to reflect some particular mode of inheritance (MOI) (e.g. for an assumed additive MOI, Xij is the count of risk alleles from the jth member of the ith family); 𝒮i is the sufficient statistic for the founder genotypes of the ith family (see Rabinowitz and Laird, 2000,); Yij is the disease phenotype of interest; Zij is an environmental covariate; and Ai is a hypothetical ascertainment indicator which is 1 for families that have been ascertained and 0 otherwise.

### Model

Figure 1 graphically illustrates the causal assumptions for a typical family-based association study using a directed acyclic graph (DAG) (Pearl, 1995), (Pearl, 2000), (Robins, 2001). It encodes, with arrows, all possible direct causal paths between the variables considered in the study. Let us first assume that all parental genotypes have been observed so that 𝒮 encodes the founder genotypes. While we allow for measured and unmeasured covariates Z and U, respectively, to affect (or be associated with) the founder genotypes, 𝒮, and the disease phenotype, Y, directly, the DAG assumes that they can affect the nonfounder genotype, X, only through 𝒮. This particular assumption underlies the validity of the class of FBATs, even in the presence of population stratification (FBATs; Rabinowitz and Laird, 2000). Note furthermore that the DAG embodies the common G×E independence assumption XZ|𝒮 (Umbach and Weinberg, 1997), which is often deemed plausible, but has nevertheless been found to be violated in specific studies (Chanock and Hunter, 2008) and should be critically assessed whenever applied.

Fig. 1.

DAG displaying the causal assumptions common for family-based genetic association study. Each arrow denotes a possible causal path. Nodes are defined as U, unmeasured covariates; Z, measured covariates; S, the sufficient statistics for the founder genotypes; X, the offspring genotypes; and Y, the disease phenotype.

Fig. 1.

DAG displaying the causal assumptions common for family-based genetic association study. Each arrow denotes a possible causal path. Nodes are defined as U, unmeasured covariates; Z, measured covariates; S, the sufficient statistics for the founder genotypes; X, the offspring genotypes; and Y, the disease phenotype.

Under these assumptions, estimating the overall genetic effect, here expressed by the arrow from X to Y, can be accomplished by stratifying on the founder genotypes 𝒮. This result follows by application of the d-separation rule (Pearl, 2000), (Robins, 2001) upon noting that the correlation between phenotype Y and genotype X induced by spurious associations from both measured and unmeasured potential confounders is removed by conditioning on 𝒮, so that only the causal effect remains. In what follows, we will formalize this into the assumption that

(2.1)
where Y(0) denotes the potential/counterfactual response of a given subject under a regime in which X is set to some reference value 0. Because X cannot have an effect on Y(0), the validity of this assumption follows from standard theory for family-based tests (Rabinowitz and Laird, 2000), (Horvath and others, 2004) that, under Mendelian transmission and in the absence of a genetic effect, potential traits and genotypes are independent conditional on 𝒮 (and Z).

Suppose now that the founder genotypes are incompletely observed so that 𝒮 refers to the sufficient statistic for the founder genotypes (Rabinowitz and Laird, 2000). It then follows from Hoffmann and others (2009) that the G×E independence assumption continues to hold. It further follows from the detailed arguments in the supplementary materials of Vansteelandt, Demeo, and others (2008) that Y(0) continues to be conditionally independent of X, given 𝒮 (and also given the environmental covariate, Z). This motivates our focus on the structural distribution model

(2.2)
Under this model, we have in particular that
and, by (2.1), that
from which it is clear that β encodes the main genetic effect and γ the G×E interaction. Under model (2.2), (2.1) further implies that
G-estimation (Robins and others, 1992), (Vansteelandt, Demeo, and others, 2008) exploits this independence restriction by obtaining a consistent estimator for β and γ as the values that solve an estimating equation of the form
(2.3)
It is noteworthy that this approach, which is equivalent to that in Vansteelandt, Demeo, and others (2008), does not require specification of the form of effects not of direct interest, that is Z and 𝒮.

### Introducing ascertainment criteria

Previous results are no longer valid in the presence of ascertainment (unless for testing the overall genetic effect null hypothesis). This can be seen by adding the ascertainment node A to the original causal diagram (see Figure 2 (a) which includes the situation where families are selected based on a proband's phenotype) and noting that the analysis is, by design, conditional on A. Indeed, stratifying on the “child” A of a collider Y modifies the association between 𝒮 and X. It follows that

thus that we can no longer use Mendel's law of segregation when the previous estimation strategy is restricted to ascertained individuals only and thus that we cannot immediately rely on Mendel's law of segregation when the analysis is restricted to ascertained individuals only. It is also seen from Figure 2 (a) that the no confounding assumption 2.1 will generally fail to hold conditional on A (see also Section 2.4).

Fig. 2.

DAGs incorporating ascertainment criterion into the usual family-based genetic association study. We denote the nodes as U, unmeasured covariates; Z, measured covariates; S, the sufficient statistics for the founder genotypes; {Y(x)}, the set of counterfactual disease traits under regimes when X is set to x, for all x; A = f(Y) is the outcome-based ascertainment criterion; A* is the counterfactual-based ascertainment criterion (A* = 1 denotes the subjects who would have been ascertained regardless of genotype); X, the offspring genotypes; and Y, the observed disease phenotype.

Fig. 2.

DAGs incorporating ascertainment criterion into the usual family-based genetic association study. We denote the nodes as U, unmeasured covariates; Z, measured covariates; S, the sufficient statistics for the founder genotypes; {Y(x)}, the set of counterfactual disease traits under regimes when X is set to x, for all x; A = f(Y) is the outcome-based ascertainment criterion; A* is the counterfactual-based ascertainment criterion (A* = 1 denotes the subjects who would have been ascertained regardless of genotype); X, the offspring genotypes; and Y, the observed disease phenotype.

A strategy to address this problem, suggested to us by James Robins, is to evaluate the estimating equation (2.3) only in subjects who would have been ascertained regardless of genotype. Related ideas have been used in G-estimation for structural nested failure time models (Robins and Tsiatis, 1991), (Robins, 2002), where the observed event times are typically artificially “recensored.” To be precise, let A(x) denote the counterfactual ascertainment status of a given subject under a regime in which X is set to x. Then we will evaluate the estimating equation only in subjects with A(x) = 1 for all x. This is informally depicted in Figure 2 (b), where A* = 1 for subjects with A(x) = 1 for all x (0 otherwise); from Figure 2 (b), we can see that Y(0) ╨X|Z,𝒮,A* = 1, XZ|𝒮,A* = 1, and A*X|𝒮, suggesting that the previous G-estimation principle remains applicable to this subset. A more formal development, which shows that application of G-estimation to this subset continues to infer the population genetic effect, is given in the next section.

### G×E interaction testing

Throughout, we will consider an outcome-dependent ascertainment criterion such that families are selected for a study based on a single proband within the family having an extreme phenotype. Specifically, a proband is randomly chosen and the corresponding family is selected if and only if that proband's phenotype is below some known value y, so that individuals with phenotypes below y are identified and family members of these probands are then also recruited. This strategy is often part of a study's eligibility criteria, for example using methacholine reactivity for asthma (Childhood Asthma Management Program, 1999), albumin–creatinine ratio for diabetic nephropathy (Mueller and others, 2006), and body mass index for obesity (Hinney and others, 1997). As an example, in the COPD study motivating the methodology, single probands with severe COPD were identified and, subsequently, all available relatives of this single proband were recruited such that the entire family's recruitment is solely contingent on the original proband.

For notational convenience, we will identify the selected proband with subject index j = 1 and define the family ascertainment indicator to be Ai = f(Yi1) = I(Yi1y). Under the null hypothesis of no G×E interaction (i.e. γ = 0), we then propose to estimate the main genetic effect β by solving an equation of the form

(2.4)
for β, where we define Aij*(β) = 1 if f{Yij + β(xXij)} = 1∀x and 0 otherwise, and due to the single proband ascertainment schema considered here, Ai*(β) = Ai1*(β) (see supplementary material available at http://www.biostatistics.oxfordjournals.org for alternative definitions of Ai*(β)).

Note that Ai*(β) = 1 for family i implies in particular that Ai = 1, so that this estimating equation is effectively selecting a subset of the observed ascertained sample. The validity of this approach requires the following generalization of assumption (2.1):

(2.5)
which is plausible when (2.1) holds and a subject's genotype does not affect the phenotype of other family members. Indeed, under this strengthened assumption and model (2.2) with γ = 0, (2.4) is a mean-zero estimating equation because I{Ai*(β) = 1}(YijβXij) is a function of {YijβXij;j = 1,…,mi}, which is independent of Xij, conditional on Zij and 𝒮i by (2.5) and because E(Xij|Zij,𝒮i) = E(Xij|𝒮i). We can then evaluate the following score test statistic for G×E interaction, bad hbox(β):
(2.6)
which has mean zero under the null hypothesis of no G×E interaction. In practice, we evaluate it with β replaced with $β^$, the solution to (2.4). Note that we choose to focus on score tests because it enables setting γ = 0, which is computationally attractive and tends to limit the deletion of families from the analysis.

Because the indicator functions I{Ai*(β) = 1} depend on the genetic effect size, the estimating function for β, that is

and the score function for the G×E interaction test, that is
are nonsmooth in β. The usual Taylor series arguments for M-estimators, which are needed to adjust the G×E interaction test for the uncertainty regarding $β^$, are no longer applicable. Assuming that
for conformable matrices Γγβ and Γββ, we have that
(2.7)
Here, for absolutely continuous phenotypes, Γγβ and Γββ can be estimated as the empirical means of − ZijI{Ai*(β) = 1}Xij{XijE(Xij|𝒮i)} and − I{Ai*(β) = 1}Xij{XijE(Xij|𝒮i)}, respectively, evaluated at $β^$. More generally, one may use the least squares–based resampling approach proposed by Zeng and Lin (2008), which we adopted in the simulation studies and data analysis. Briefly, this involves assessing how fluctuations in $β^$ induce fluctuations in the scores $Uij,β(β^)$ and $Uij,γ(β^)$, and then using least squares regression to summarize these associations. B realizations from a random, mean zero vector are generated and denoted by Z1,…,ZB. $Γ^ββ$ is then calculated as the least squares estimate from regressing $n−1/2Uij,β(β^+n−1/2Zb)(b=1,…,B)$ on Zb(b = 1,…,B). Similarly, $Γ^γβ$ is the least squares estimate from regressing $n−1/2Uij,γ(β^+n−1/2Zb)(b=1,…,B)$ on Zb(b = 1,…,B). From (2.7), the asymptotic variance of the test statistic $n−1/2∑ijUij,γ(β^)$ can now be evaluated as the empirical variance of $Uij,γ(β^)−Γ^γβΓ^ββ−1Uij,β(β^)$ with $β^,Γ^γβ$ and $Γ^ββ$ held fixed. We now have a test of G×E interaction by comparing the ratio of $n−1/2∑ijUij,γ(β^)$ and its standard deviation to a standard normal distribution. This test avoids distributional assumptions while taking into account the imposed ascertainment conditions.

## SIMULATION STUDIES

In each replicate of the simulation study (10 000 replicates per parameter combination), we simulate a single nucleotide polymorphism (SNP) and an environmental covariate for 2000 trios by first simulating parental genotypes and assuming Mendelian transmissions to the proband. The offspring trait, Y, is generated based on the model: Yi = βXXi + Zi + βX×Z(XiZi) + εi, where Xi∈{0,1,2} (reflecting an additive MOI), ZiN(0,1) and εiN(0,1). The effect sizes are then calculated based on the specified heritability for the main effect of genotype (hX2) and the G×E effect (hX×Z2). Simulated traits are compared against an ascertainment rule (e.g. Y > y0.90, where yp for p∈[0,1] denotes the p100% percentile of the distribution of Y), and this is repeated until 2000 “ascertained” trios have been generated. Figures 3(a)–4(b) display results from scenarios with various minor allele frequencies (MAFs; p = 5–50%), ascertainment rules (Y > ya with a = 0.50 and 0.90) and heritabilities (hX2 & hX×Z2, each either 0 or 2.5%). Note that the notion of “extreme phenotype” in the simulations defines ascertainment through the upper tail of the trait distribution. The performance of the proposed method is independent of the direction of the ascertainment condition. For each ascertainment rule examined, the pair of supporting figures consists of (a) type I error rates (Figure 3) for the test of G×E interaction (i.e. under the interaction null of hX×Z2 = 0), first without a main effect (hX2 = 0) and then in the presence of a main genetic effect (hX2 = 2.5%) and (b) powers (Figure 4) assuming hX×Z2 = 2.5%, again displaying panels both with and without a main genetic effect.

Fig. 3.

Empirical type I error rates over 10 000 simulations of a study recruiting 2000 trios and employing either a 50% or 90% ascertainment rule.

Fig. 3.

Empirical type I error rates over 10 000 simulations of a study recruiting 2000 trios and employing either a 50% or 90% ascertainment rule.

Figure 4

Empirical power over 10 000 simulations of a study recruiting 2000 trios and employing either a 50% or 90% ascertainment rule.

Figure 4

Empirical power over 10 000 simulations of a study recruiting 2000 trios and employing either a 50% or 90% ascertainment rule.

Studies recruiting only probands with traits above the population median (i.e. using an ascertainment rule of Y > y0.50) incur a mild inflation of type I error when a genetic main effect is present when employing QBAT-I (Vansteelandt, Demeo, and others, 2008), the test statistic that ignores the ascertainment conditions (Figure 3(a)). This inflation is most pronounced at lower MAFs. For example, the most severe inflation observed in the simulations was for a SNP with a MAF of 5% (type I error rate of 7.1% using a nominal α = 5%). Conversely, SG×E is mildly conservative when a main genetic effect is present. Both test statistics maintain the nominal significance level under the complete null hypothesis of no genetic effect. Under this ascertainment schema, however, QBAT-I is uniformly more powerful than SG×E (although powers are difficult to compare when type I error rates are inflated) and, other than at low MAF, power is a decreasing function of MAF (Figure 4(a)).

When trait ascertainment is more stringent, the consequences of failing to properly account for it are more evident. Figures 3(b) and 4(b) display the type I error rates and powers under an ascertainment rule of Y > y0.90. Here, we see that both tests preserve the nominal significance under the complete null hypothesis; however, when there is a main genetic effect, the bias observed using QBAT-I is quite pronounced, here increasing as MAF does, while SG×E remains mildly conservative (Figure 3(b)). Interestingly, other than with rare SNPs and in the presence of a main genetic effect, Sbad hbox is consistently more powerful than QBAT-I, and confers a 30% improvement in power for many MAFs. This is surprising in the absence of a main genetic effect, in which case also QBAT-I is valid. As with median-based ascertainment, power is a decreasing function of MAF; this suggests that ascertainment strategies can be more useful when applied to rarer SNPs. Also of note is that, curiously, when examining a SNP with a main effect and a MAF of 20% or above, the empirical rejection rates for QBAT-I are actually higher under the interaction null hypothesis (hX×Z2 = 0) than the alternative (hX×Z2 = 2.5%). This clearly shows the importance of employing an analytic strategy that is appropriate for the given study design.

### Comparison to a likelihood-based approach

We additionally examine a strategy that posits a truncated normal distribution for the offspring trait. Maximum likelihood estimates are calculated from the model Yi = β0 + βXXi + βEXE(Xi|𝒮i) + βZZi + βX×Z(XiZi) + εi conditional on Yi > y0.90. A standard test for H0:βX×Z = 0 is then used. We find that the likelihood approach confers more power than the proposed method but, similar to QBAT-I, suffers from inflated type I error under a variety of scenarios. This bias is likely finite-sample bias due to the difficulties inherent in using only a tail of the normal distribution to estimate the rest of the distribution. See Section S1.1 of the supplementary material (available at Biostatistics online) for more detailed simulation results.

### Introducing misspecification

In all above simulations, we assume the data-generating mechanism to be known; however, it is often the case that components of the mechanism are not known. We consider the effects of (i) misspecifying the MOI, for example testing with the assumption of an additive MOI when in fact the genetic effect is dominant, (ii) misspecifying the offspring trait distribution, and (iii) misspecifying the form of the interaction effect. To assess the effects of a misspecified MOI, we simulate as before and vary the true MOI and that assumed for testing. Our proposed approach remains valid except when data are generated with a recessive MOI and the minor allele is rare (e.g. 5%). Empirical power is not greatly affected when testing with a dominant MOI when the true MOI is additive and vice versa; however, testing assuming a recessive MOI results in much loss of power unless the true MOI is in fact recessive. Results are provided in Section S1.2.1 of the supplementary material (available at Biostatistics online).

We also simulate scenarios with offspring trait errors generated from a Gamma(1,1) distribution (i.e. εi∼Γ(1,1)) and others where the environmental covariate enters the interaction as a squared term, βX×Z(XiZi2). Our approach remains valid under these types of specification, and bias is amplified for both the QBAT-I and likelihood approaches. These results are displayed in Section S1.2.2 in the supplementary material (available at Biostatistics online).

## APPLICATION TO A CHRONIC OBSTRUCTIVE PULMONARY DISEASE STUDY

As in Vansteelandt, Demeo, and others (2008), we examine the Boston Early-Onset COPD Study of Silverman and others (1998). We test the same 6 SNPs located in the SERPINE2 gene previously found to be in a linkage peak for COPD (Demeo and others, 2006). The study consists of 128 extended pedigrees with a maximum family size of 27 members, each ascertained from a single proband satisfying (i) forced expiratory volume at 1 s (FEV1) less than 40% of predicted, (ii) less than 53 years of age, and (iii) no evidence of severe alpha 1-antitrypsin deficiency. Because the study's primary phenotype is FEV1, we apply the new G×E interaction testing strategy as it relates to the first ascertainment criterion because the remaining ascertainment conditions relate to baseline covariates and can henceforth be ignored (when the model is stated conditional on these covariates). That is to estimate the main genetic effect of FEV1, we place the restriction that the proband from each pedigree have FEV1 < 40% of predicted regardless of the proband's true genotype as detailed in Section 2.4. The remainder of each proband's extended family is included in the analysis only when this criterion is satisfied.

Smoking is the main risk factor for COPD. As such, it is the environmental covariate investigated in the G×E interaction analysis. Two measures of smoking are used here: a binary version that differentiates subjects who have ever smoked before (ever-smokers) from those who have not (never-smokers); and a continuous version that serves as a proxy for lifetime exposure, pack years (one pack year is defined as smoking one pack of cigarettes per day for 1 year). Consistent with previous analyses of this data, we assume a dominant MOI for each SNP and a null hypothesis of linkage and no association as the tested SNPs were previously found to be under a linkage peak. There is no evidence to support violation of the G×E conditional independence assumption in these data, and an ongoing study corroborates this by finding no nominally significant associations between the investigated SNPs and various smoking covariates (data not shown).

Table 1 displays the results of this analysis employing both the new test, Sbad hbox, and QBAT-I (Vansteelandt, Demeo, and others, 2008). For each SNP, the p values are given for both tests. The new approach yields less significant results from ser8, but a potentially more powerful test when considering ser51 and the quantitative phenotype pack years (Sbad hbox-p = 0.047 vs. QBAT-I-p = 0.082). These findings are consistent with those of Vansteelandt, Demeo, and others (2008) that indicated a haplotype-smoking interaction for haplotypes containing the ser51 SNP and also that the quantitative phenotype pack years is more powerful than its binary counterpart.

Table 1.

Summary of results for COPD study. Each SNP is assumed to act in a dominant fashion, and all analyses assumed a null hypothesis of linkage and no association. G × E interaction p values from both the proposed test statistic and QBAT-I are presented for the continuous covariate, pack years, and the binary measure, ever-smoker

 Markers Pack years Ever-smoker SG × E-p QBAT-I-p SG × E-p QBAT-I-p ser37 0.118 0.126 0.152 0.163 ser8 0.662 0.243 0.951 0.958 ser51 0.047 0.082 0.455 0.816 ser55 0.209 0.244 0.703 0.774 ser50 0.213 0.271 0.745 0.825 ser6 0.232 0.249 0.870 0.941
 Markers Pack years Ever-smoker SG × E-p QBAT-I-p SG × E-p QBAT-I-p ser37 0.118 0.126 0.152 0.163 ser8 0.662 0.243 0.951 0.958 ser51 0.047 0.082 0.455 0.816 ser55 0.209 0.244 0.703 0.774 ser50 0.213 0.271 0.745 0.825 ser6 0.232 0.249 0.870 0.941

## DISCUSSION

The current era of large-scale genetics has seen many successes from conducting genome-wide association studies (GWAS) of complex diseases; to date, novel SNPs for well over 100 disease traits have been discovered using this approach (Hindorff and others, 2010). The so-called problem of “missing heritability,” however, has caused some to criticize GWAS (Goldstein, 2009). Others have been more optimistic and have addressed the manners in which current and future GWAS can be used to uncover more of the genetic predisposition for complex diseases (McCarthy and Hirschhorn, 2008), (Manolio and others, 2009); not the least among the suggested strategies are using family samples and exploring G×G and G×E interactions. It is integral that appropriate, powerful, and unbiased statistical methodologies be designed for detecting G×E interaction with quantitative disease traits in this context of large-scale analysis as well as in smaller studies.

We have proposed a novel test for G×E interaction that does not require any modeling assumptions for the phenotype distribution, appropriately accounts for outcome-dependent ascertainment, is robust to confounding due to population substructure, and allows for incomplete parental genotypes. Our results support the use of this method for any family study investigating G×E interactions that employs an ascertainment criterion based on a quantitative trait provided that there is support for the G×E conditional independence assumption (or, at the least, no strong evidence to that it fails to hold). If this assumption is implausible, one could in principle derive the distribution of X|Z,𝒮 utilizing the known law of X|𝒮 and an application of Bayes' rule in order to modify our approach. However, this would require modeling assumptions that have been intentionally avoided in the proposed method. By focusing on estimation of a scalar parameter, we additionally avoid problems associated with the potential for identifying multiple solutions (Joffe and others, 2011).

A potential limitation of the proposed approach is that when the genotype coding can take on many levels (e.g. when X refers to a haplotype coding), then the set of families who would have been ascertained regardless of their genotype may be small, in which case the proposed test might have very limited power. This can be partially remedied by restricting the focus to a smaller set Kx with bounded support, containing a subset of the observed genotypes (e.g. haplotypes satisfying some threshold of observed frequency). In particular, we can redefine Aij*(β) = 1 if f{Yij + β(xXij)} = 1∀xKx and 0 otherwise, and modify the estimating equation (2.4) for β to

and the G×E test statistic to
This modification is motivated by the more general results in Robins (Theorem A4.1 2002); its validity is readily verified from the unbiasedness of these equations under the null hypothesis. In future work, we hope to extend the proposed methods to accommodate more general ascertainment conditions that are based not on the tested phenotype but rather a correlated endophenotype.

## SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

## FUNDING

National Institutes of Health [NCRR] (P20RR020145 and 5P20RR016481-10 to D.W.F.); IAP research network (P06/03 from the Belgian government [Belgian Science Policy] and Ghent University [Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks”] to S.V.). Conflict of Interest: None declared.

We are very grateful to Jamie Robins for suggesting the potential use of artificial censoring and for many useful discussions and suggestions. We also thank two anonymous reviewers for insightful comments that improved the manuscript.

## References

Chan
AT
Tranah
GJ
Giovannucci
EL
Willett
WC
Hunter
DJ
Fuchs
CS
Prospective study of n-acetyltransferase-2 genotypes, meat intake, smoking and risk of colorectal cancer
International Journal of Cancer
,
2005
, vol.
115
(pg.
648
-
652
)
Chanock
SJ
Hunter
DJ
Genomics: when the smoke clears.
Nature
,
2008
, vol.
452
(pg.
537
-
538
)
Chatterjee
N
Kalaylioglu
Z
Carroll
RJ
Exploiting gene-environment independence in family-based case-control studies: increased power for detecting associations, interactions and joint effects
Genetic Epidemiology
,
2005
, vol.
28
(pg.
138
-
156
)
Childhood Asthma Management Program
The childhood asthma management program (camp): design, rationale, and methods. Childhood asthma management program research group
Controlled Clinical Trials
,
1999
, vol.
20
(pg.
91
-
120
)
Cordell
HJ
Barratt
BJ
Clayton
DG
Case/pseudocontrol analysis in genetic association studies: A unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parent-of-origin effects
Genetic Epidemiology
,
2004
, vol.
26
(pg.
167
-
185
)
Demeo
DL
Mariani
TJ
Lange
C
Srisuma
S
Litonjua
AA
Celedon
JC
Lake
SL
Reilly
JJ
Chapman
HA
Mecham
BH
and others
The SERPINE2 gene is associated with chronic obstructive pulmonary disease
American Journal of Human Genetics
,
2006
, vol.
78
(pg.
253
-
264
)
Dudbridge
F
Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data
Human Heredity
,
2008
, vol.
66
(pg.
87
-
98
)
García-Closas
M
Malats
N
Silverman
D
Dosemeci
M
Kogevinas
M
Hein
DW
Tardón
A
Serra
C
Carrato
A
García-Closas
R
and others
Nat2 slow acetylation, gstm1 null genotype, and risk of bladder cancer: results from the spanish bladder cancer study and meta-analyses
Lancet
,
2008
, vol.
366
(pg.
649
-
659
)
Goldstein
DB
Common genetic variation and human traits
New England Journal of Medicine
,
2009
, vol.
360
(pg.
1696
-
1698
)
Greenland
S
Lanes
S
Jara
M
Estimating effects from randomized trials with discontinuations: the need for intent-to-treat design and g-estimation
Clinical Trials
,
2008
, vol.
5
(pg.
5
-
13
)
Hindorff
LA
Junkins
HA
Hall
PN
Mehta
JP
Manolio
T A
A catalog of published genome-wide association studies

www.genome.gov/gwastudies (Accessed October 25, 2010)
Hines
LM
Stampfer
MJ
Ma
J
Gaziano
JM
Ridker
PM
Hankinson
SE
Sacks
F
Rimm
EB
Hunter
DJ
Genetic variation in alcohol dehydrogenase and the beneficial effect of moderate alcohol consumption on myocardial infarction
New England Journal of Medicine
,
2001
, vol.
344
(pg.
549
-
555
)
Hinney
A
Lentes
KU
Rosenkranz
K
Barth
N
Roth
H
Ziegler
A
Hennighausen
K
Coners
H
Wurmser
H
Jacob
K
and others
International Journal of Obesity and Related Metabolic Disorders
,
1997
, vol.
21
(pg.
224
-
230
)
Hoffmann
TJ
Lange
C
Vansteelandt
S
Laird
NM
Gene-environment interaction tests for dichotomous traits in trios and sibships
Genetic Epidemiology
,
2009
, vol.
33
(pg.
691
-
699
)
Horvath
S
Xu
X
Lake
SL
Silverman
EK
Weiss
ST
Laird
NM
Family-based tests for associating haplotypes with general phenotype data: application to asthma genetics
Genetic Epidemiology
,
2004
, vol.
26
(pg.
61
-
69
)
Joffe
MM
Brensinger
C
Weighting in instrumental variables and g-estimation
Statistics in Medicine
,
2003
, vol.
22
(pg.
1285
-
1303
)
Joffe
MM
Yang
WP
Feldman
H
G-estimation and artificial censoring: problems, challenges, and applications
Biometrics
,
2011

doi: 10.1111/j.1541-0420.2011.01656.x. Available from http://onlinelibrary.wiley.com/doi/10.1111/j.1541-0420.2011.01656.x/full
Laird
NM
Horvath
S
Xu
X
Implementing a unified approach to family-based tests of association
Genetic Epidemiology
,
2000
, vol.
19

Suppl 1
(pg.
S36
-
S42
)
Lake
SL
Laird
NM
Tests of gene-environment interaction for case-parent triads with general environmental exposures
Annals of Human Genetics
,
2004
, vol.
68
(pg.
55
-
64
)
Liu
Y
Tritchler
D
Bull
SB
A unified framework for transmission-disequilibrium test analysis of discrete and continuous traits
Genetic Epidemiology
,
2002
, vol.
22
(pg.
26
-
40
)
Manolio
TA
Collins
FS
Cox
NJ
Goldstein
DB
Hindorff
LA
Hunter
DJ
Mccarthy
MI
Ramos
EM
Cardon
LR
Chakravarti
A
and others
Finding the missing heritability of complex diseases
Nature
,
2009
, vol.
461
(pg.
747
-
753
)
McCarthy
MI
Hirschhorn
JN
Genome-wide association studies: potential next steps on a genetic journey
Human Molecular Genetics
,
2008
, vol.
17
(pg.
R156
-
R165
)
Moerkerke
B
Vansteelandt
S
Lange
C
A doubly robust test for gene-environment interaction in family-based studies of affected offspring
Biostatistics
,
2010
, vol.
11
(pg.
213
-
225
)
Mueller
PW
Rogus
JJ
Cleary
PA
Zhao
Y
Smiles
AM
Steffes
MW
Bucksa
J
Gibson
TB
SK
Krolewski
AS
and others
Genetics of kidneys in diabetes (gokind) study: a genetics collection available for identifying genetic susceptibility factors for diabetic nephropathy in type 1 diabetes
Journal of the American Society of Nephrology
,
2006
, vol.
17
(pg.
1782
-
1790
)
Pearl
J
Causal diagrams for empirical research
Biometrika
,
1995
, vol.
82
pg.
669

Pearl
J
Causality: Models, Reasoning and Inference.
,
2000
Cambridge, UK
Cambridge University Press
Rabinowitz
D
Laird
N
A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information
Human Heredity
,
2000
, vol.
50
(pg.
211
-
223
)
Robins
JM
Data, design, and background knowledge in etiologic inference
Epidemiology
,
2001
, vol.
12
(pg.
313
-
320
)
Robins
JM
Ostrow
DG
Kessler
R
Analytic methods for estimating HIV treatment and cofactor effects
Methodological Issues of AIDS Mental Health Research
,
2002
, vol.
1993

New York
Plenum Publishing
(pg.
213
-
290
)
Robins
JM
Lin
DY
Heagerty
P
Optimal structural nested models for optimal sequential decisions
Proceedings of the Second Seattle Symposium on Biostatistics Analysis of Correlated Data
,
2004
New York
Springer
(pg.
189
-
326
)
Robins
JM
Mark
SD
Newey
WK
Estimating exposure effects by modelling the expectation of exposure conditional on confounders
Biometrics
,
1992
, vol.
48
(pg.
479
-
495
)
Robins
JM
Tsiatis
AA
Correcting for non-compliance in randomized trials using rank preserving structural failure time models
Communications in Statistics-Theory and Methods
,
1991
, vol.
20
(pg.
2609
-
2631
)
Silverman
EK
Progress in chronic obstructive pulmonary disease genetics
Proceedings of the American Thoracic Society
,
2006
, vol.
3
(pg.
405
-
408
)
Silverman
EK
Chapman
HA
Drazen
JM
Weiss
ST
Rosner
B
Campbell
EJ
O'Donnell
WJ
Reilly
JJ
Ginns
L
Mentzer
S
and others
Genetic epidemiology of severe, early-onset chronic obstructive pulmonary disease. Risk to relatives for airflow obstruction and chronic bronchitis
American Journal of Respiratory and Critical Care Medicine
,
1998
, vol.
157
(pg.
1770
-
1778
)
Umbach
DM
Weinberg
CR
Designing and analysing case-control studies to exploit independence of genotype and exposure
Statistics in Medicine
,
1997
, vol.
16
(pg.
1731
-
1743
)
Umbach
DM
Weinberg
CR
The use of case-parent triads to study joint effects of genotype and exposure
American Journal of Human Genetics
,
2000
, vol.
66
(pg.
251
-
261
)
Vansteelandt
S
Demeo
DL
J
Smoller
JW
Murphy
AJ
McQueen
M
Schneiter
K
Celedon
JC
Weiss
ST
Silverman
EK
and others
Testing and estimating gene-environment interactions in family-based association studies
Biometrics
,
2008
, vol.
64
(pg.
458
-
467
)
Vansteelandt
S
VanderWeele
TJ
Tchetgen
EJ
Robins
JM
Multiply robust inference for statistical interactions
Journal of the American Statistical Association
,
2008
, vol.
103
(pg.
1693
-
1704
)
Wang
Y
Yang
Q
Rabinowitz
D
Unbiased and locally efficient estimation of genetic effect on quantitative trait in the presence of population admixture
Biometrics
,
2011
, vol.
67
(pg.
331
-
343
)
Whittemore
AS
Estimating genetic association parameters from family data
Biometrika
,
2004
, vol.
91
pg.
219

Yang
Q
Khoury
MJ
Sun
F
Flanders
WD
Case-only design to measure gene-gene interaction
Epidemiology
,
1999
, vol.
10
(pg.
167
-
170
)
Yang
Q
Rabinowitz
D
Isasi
C
Shea
S
Adjusting for confounding due to population admixture when estimating the effect of candidate genes on quantitative traits
Human Heredity
,
2000
, vol.
50
(pg.
227
-
233
)
Zeng
D
Lin
DY
Efficient resampling methods for nonsmooth estimating functions
Biostatistics
,
2008
, vol.
9
(pg.
355
-
363
)