Selection bias is a common concern in epidemiologic studies, particularly case-control studies. Selection bias in the odds ratio occurs when participation depends jointly on exposure and disease status. General results for understanding when selection bias may affect studies involving gene-environment interactions have not yet been developed. In this paper, the authors show that the assessment of gene-environment interactions will not be subject to selection bias under the assumption that genotype does not influence participation conditional on exposure and disease status. This is true even when selection, including self-selection of subjects, is jointly influenced by exposure and disease and regardless of whether the genotype is related to exposure, disease, or both. The authors present an example to illustrate this concept.
Received for publication October 16, 2002; accepted for publication February 25, 2003.
In the search for causative agents of human diseases, both environmental and genetic risk factors have been identified. The relative influence of the two is highly variable, yet for most diseases, it is unlikely that purely environmental or purely genetic etiologies will sufficiently explain the observed variability in disease occurrence. One challenge in epidemiologic studies is determining the nature and extent to which environmental agents and genetic factors influence disease risk, both as independent factors and as modifiers of each other. Such studies have led investigators to examine the role of genetic susceptibility in exposure-disease relations. Studies examining the modifying effect of genes on environmental exposures, often referred to as gene-environment interactions, are increasingly common. Such analyses can identify genetic subpopulations of persons for whom risk factors are most relevant, as well as clarify the biologic mechanisms of exposure-disease relations.
Selection bias occurs in epidemiologic studies when there are systematic differences in characteristics between persons who are selected for study and those who are not (1). Such differences can arise from the procedures used to select subjects and/or from factors that influence study participation (self-selection), the end result being that the relation between exposure and disease is different for persons who participate and persons who should theoretically be eligible for study (2). It is well known that selection bias occurs when response proportions are jointly dependent on exposure and disease status (3, 4). Selection bias is a particular problem in case-control studies, since exposure and disease have both occurred at the time study subject selection is made (5). While selection bias is a possibility even when response rates for recruitment are high for both cases and controls, it is a particular concern when response rates are low. Studies analyzing participation rates have found that responders and nonresponders differ with regard to various characteristics, such as age, employment status, and race (6). People who have the disease under study are more likely to participate in a study than nondiseased controls (7). Participation may also depend on exposures—for example, persons with exposures perceived to be socially unacceptable, such as consumption of alcoholic beverages or smoking, may be less likely to participate (8).
While selection bias is a major concern in exposure-disease associations, it can also be a concern in gene-disease associations, because genotype could be associated with exposures that influence participation. However, we show here that the assessment of gene-environment interactions will not be subject to selection bias under the likely scenario that genotype does not influence participation conditional on the exposure and disease status. This is true even when selection, including self-selection of subjects, is jointly influenced by exposure and disease and regardless of whether the genotype is related to exposure, disease, or both.
METHODS AND RESULTS
Selection bias in the exposure-by-disease 2 × 2 table
The degree of selection bias in the estimate of exposure-disease relations for a dichotomous exposure and a dichotomous disease can be expressed in terms of selection proportions. Suppose the number of people in the exposure-by-disease 2 × 2 table representing the target population of the epidemiologic study is as shown in figure 1. The target (true) exposure-disease odds ratio is ORT = AD/BC. In the observed case-control study, which is subject to selection bias due to sample selection and self-selection, the number of subjects by exposure and disease status may be expressed as in figure 1. In this case, the observed exposure-disease relation is ORO = ad/bc.
Selection proportions are the proportions of persons in the target population who participate in the epidemiologic study, and they are influenced by sampling methods and self-selection by the study subjects. For each of the disease × exposure groups, the selection probabilities are as follows: α = selection of persons with the exposure and the disease (a/A); β = selection of persons with the exposure but not the disease (b/B); γ = selection of persons without the exposure and with the disease (c/C); and δ = selection of persons without the exposure or the disease (d/D). The observed exposure-disease odds ratio, relative to the true odds ratio, may be expressed as ORO = ad/bc = αAδD/βBγC = (αδ/βγ) × ORT.
If the cross-product αδ/βγ = 1, then selection bias does not affect estimates. For example, there is no selection bias when disease influences participation, but within disease groups, the response proportions for those exposed and not exposed are the same (i.e., α = γ and β = δ). Similarly, there is no selection bias when only exposure is related to participation rates (i.e., α = β and γ = δ) or if these two effects are independent. If disease and exposure status jointly affect participation rates, that is, if αδ/βγ ≠ 1, then the observed odds ratio is biased (4).
Selection bias in gene-environment interaction
In a gene-environment interaction, one stratifies subjects on genotype status to determine whether the exposure-disease relation differs according to genotype. Then the distributions of exposure and disease in the target population and the exposure-disease relation, within genotype, are as shown in figure 2. The true interaction odds ratio (ORINT-T), defined (in this case) as the multiplicative factor by which the mutant exposure-disease odds ratio differs from the wild-type exposure-disease odds ratio, is ORINT-T = (A2D2/B2C2)/(A1D1/B1C1). (ORINT is analogous to ec, where c is the coefficient of the interaction term in a logistic model with terms for exposure (1 = exposed, 0 = not exposed), genotype (1 = mutant, 0 = wild-type), and their interaction.)
Similarly, in an epidemiologic study, the distributions of exposure and disease and the exposure-disease relation by genotype may be represented as in figure 2. The observed interaction odds ratio, ORINT-O, is ORINT-O = (a2d2/b2c2)/(a1d1/b1c1). The selection proportions among persons with the wild-type genotype are denoted by α1, β1, γ1, and δ1, and the selection proportions among persons with mutant genotypes are denoted by α2, β2, γ2, and δ2. Therefore, the observed interaction odds ratio can be represented as
ORINT-O = (α2A2δ2D2/β2B2γ2C2)/(α1A1δ1D1/β1B1γ1C1) = [(α2δ2/β2γ2)/(α1δ1/β1γ1)] × ORINT-T.
Under the circumstances described in the section above, each of the stratum- (genotype-) specific odds ratios would be affected by selection bias. However, if genotype does not influence response proportions conditional on exposure and disease status, that is, if α1 = α2, β1 = β2, γ1 = γ2, and δ1 = δ2, then the observed interaction odds ratio is equal to the target interaction odds ratio: ORINT-O = ORINT-T. Thus, under these conditions, the interaction odds ratio is not affected by selection bias even though the stratum-specific odds ratios are biased.
Example: alcohol consumption, aldehyde dehydrogenase 2, and esophageal cancer risk
To illustrate this result, let us consider a hypothetical study evaluating the relation of alcohol consumption, aldehyde dehydrogenase 2 (one of the key enzymes in the oxidation of alcohol to acetate), and esophageal cancer risk. A polymorphism in the aldehyde dehydrogenase 2 gene ALDH2 caused by a structural point mutation that results in a Glu→Lys substitution is associated with phenotypic loss of enzymatic activity (9). Persons with this allele, ALDH2*2, are deficient in aldehyde dehydrogenase 2 activity and tend to refrain from excessive alcohol drinking because of their adverse reaction to alcohol (10). This polymorphism is highly prevalent in Asian populations, although it is rare in other ethnic groups (10, 11). Because this gene encodes an enzyme that is critical for the elimination of acetaldehyde (a carcinogen) generated by alcohol, the *2 allele (which is inactive) is associated with several cancers, including esophageal cancer (12). While case-only analyses of gene-environment interactions require independence between the gene and the environment, this assumption is not required in order for case-control analyses to yield unbiased results (13).
Let us assume that our target population is selected to have a high prevalence of this polymorphism (i.e., an Asian population) and that there are 1,000 cases of esophageal cancer and 1,000,000 cancer-free individuals. Figure 3 shows the distributions of exposure, genotype, and disease in this hypothetical target population. Assume that among cases, 73 percent have high alcohol consumption, while among controls, 50 percent have high alcohol consumption, such that the true exposure-disease odds ratio is 2.75. In this population, the frequency of homozygosity for the *2 allele (called “mutant” in this example) is assumed to be 50 percent, while persons with the other genotypes (the homozygous wild-types and heterozygotes, called “wild-type”) comprise the remaining 50 percent of the population. Among the cases, 75 percent have the mutant genotype and 25 percent have the wild-type genotype. Thus, the true gene-disease odds ratio in this population is 3.0. Finally, among persons with the mutant genotype, the proportion with high alcohol consumption is 40 percent, while the proportion with high alcohol consumption among those with the wild-type genotype is 60 percent. Therefore, the exposure-genotype odds ratio is 0.45. Among persons with the mutant genotype, the true exposure-disease odds ratio is 4.0, while among those with the wild-type genotype, the exposure-disease odds ratio is 2.0; this results in an interaction odds ratio of 2.0.
In our hypothetical observational epidemiologic study, 100 percent of eligible cases and 0.1 percent of eligible controls are selected to participate. However, because of self-selection in our case population, only 85 percent of cases with high alcohol consumption agree to participate, while 90 percent of cases with low alcohol consumption agree to participate. Among controls, only 50 percent of those with high alcohol consumption agree to participate, while 90 percent of those with low alcohol consumption agree to participate. Thus, the selection probabilities are α = 0.85, β = 0.0005, γ = 0.90, and δ = 0.0009 and are independent of genotype within each exposure/disease stratum. The numbers of participants in each disease and exposure stratum who enter the epidemiologic study are shown in figure 3. Since the cross-product of the selection probabilities is 1.7, the observed odds ratio is biased by a factor of 1.7, resulting in an observed odds ratio of 4.65 (2.75 × 1.7 ≈ 4.65). The joint effect of case status and exposure status affects participation in this example; subsequently, our main exposure-disease association is biased.
Similarly, the gene-disease odds ratio and the gene-exposure odds ratio are biased, because genotype is related to exposure and disease, which influence the selection proportions (to compute this, one must use the stratum-specific numbers at the bottom of figure 3). However, because participation is not affected by genotype itself but is only affected through its association with exposure and disease, stratification on genotype and calculation of the ratio of the exposure-disease odds ratio will not produce a biased estimate of the interaction odds ratio. In this example, we demonstrate that even in the instance where there is a gene-exposure relation and a gene-disease relation, and exposure and disease jointly affect selection probabilities, selection bias in case-control studies does not bias gene-environment interaction estimates.
The possibility of introducing selection bias when conducting epidemiologic studies, particularly case-control studies, is a major concern. Reasons for participating in a study differ between cases and controls, and the decision as to whether to participate may be influenced by a person’s exposure history as well as his or her disease status. Studies examining genetic exposures may involve an additional layer of complexity, since willingness by participants to provide a biologic specimen may be affected by many of the same, or different, factors (14).
There are limited methods available to correct for the effects of selection bias on estimates of the exposure-disease relation. First, if one has information on the four (exposure-by-disease) selection proportions, one can adjust the observed odds ratio to correct for selection bias (4). However, selection proportions are rarely known, because this requires information on the exposure-by-disease distribution of the study nonrespondents. Secondly, selection bias due to selection factors that are not the main exposure of interest can be adjusted for in the analysis (as confounding factors) to correct for selection bias (2), but this technique cannot be applied when the exposure of interest is a selection factor.
Therefore, one usually cannot correct for selection bias if the exposure of interest influences selection (jointly with disease), and this may bias the existence, strength, and direction of main associations between exposures and disease. However, we have shown that the assessment of gene-environment interaction odds ratios in epidemiologic studies is not affected by selection bias when the genotype does not influence selection conditional on exposure and disease status. Wacholder et al. (15) recently presented a related result for selection bias in gene-environment interactions in case-control studies using hospital controls. Their results apply to the more limited situation in which the only sources of selection bias are the risk factors for the control disease (i.e., the controls do not have the same gene-exposure distribution as the ideal “target” controls, and this leads to the selection bias). Wacholder et al. concluded that there is no bias in the estimation of gene-environment interactions for the disease of interest when there is no gene-environment interaction for the control disease, even when the control condition is caused by the genetic or environmental factor.
The main assumption for our results, that genotype does not influence selection conditional on exposure and disease, seems likely to be true in most situations. Specifically, it seems reasonable to assume that one’s genotype cannot influence participation in a study, other than through some association of genotype with phenotype or with behavior. Genotype could, of course, influence participation through the relation of genotype with selection factors other than the exposure of interest, in which case the gene-environment interaction estimate will be biased as well. As an illustration of this, let us assume in the above example that the allele frequency of our polymorphism in ALDH2 is more common among certain Asian ethnic groups and that persons from those groups are less likely to participate as controls than persons from other racial groups. Then the gene-environment interaction estimate will be biased. This is because the genotype affects participation rates, not directly but through its association with another selection factor. However, as we noted above, the selection bias contributed by a selection factor (in this case, ethnic group) other than the main exposure can be controlled for by controlling for the selection factor in the statistical analysis (2). Thus, the effect of this source of selection bias can be corrected using standard adjustment methods.
In summary, studies in which high nonparticipation rates raise concerns about selection bias are still able to generate valid estimates of gene-environment interactions. Identification of such interaction can help to define and clarify biologic pathways between exposure and disease in certain subsets of people.
This research was supported by National Institutes of Health training grant 5-T32-CA09168.
Correspondence to Emily White, Cancer Prevention Research Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, MP-702, P.O. Box 19024, Seattle, WA 98109 (e-mail: firstname.lastname@example.org).