-
PDF
- Split View
-
Views
-
Cite
Cite
Komla M Gnona, William C L Stewart, Revisiting the Wald Test in Small Case-Control Studies With a Skewed Covariate, American Journal of Epidemiology, Volume 191, Issue 8, August 2022, Pages 1508–1518, https://doi.org/10.1093/aje/kwac058
- Share Icon Share
Abstract
The Wald test is routinely used in case-control studies to test for association between a covariate and disease. However, when the evidence for association is high, the Wald test tends to inflate small P values as a result of the Hauck-Donner effect (HDE). Here, we investigate the HDE in the context of genetic burden, both with and without additional covariates. First, we examine the burden-based P values in the absence of association using whole-exome sequence data from 1000 Genomes Project reference samples (n = 54) and selected preterm infants with neonatal complications (n = 74). Our careful analysis of the burden-based P values shows that the HDE is present and that the cause of the HDE in this setting is likely a natural extension of the well-known cause of the HDE in 2 × 2 contingency tables. Second, in a reanalysis of real data, we find that the permutation test provides increased power over the Wald, Firth, and likelihood ratio tests, which agrees with our intuition since the permutation test is valid for any sample size and since it does not suffer from the HDE. Therefore, we propose a powerful and computationally efficient permutation-based approach for the analysis and reanalysis of small case-control association studies.
Abbreviation
- 1KG
1000 Genomes Project
- HDE
Hauck-Donner effect
- IF
inflation factor
- LRT
likelihood ratio test
- NC
neonatal complications
- SNP
single nucleotide polymorphism
INTRODUCTION
A case-control study is a very common statistical design for testing covariates for association with disease (1, 2). This design often uses standard techniques, like the Wald test, to assess the evidence for association. For case-control studies, the Wald test statistic is the estimated log odds ratio divided by an estimate of its standard deviation. It is attractive because there are explicit formulas for computing both the numerator and the denominator of the test statistic, which reduces the computation time and simplifies the interpretation of statistically significant results (3, 4). However, in small case-control studies, the Wald test inflates small P values (or alternatively decreases −log10P value) (5–7) because the numerator of the Wald statistic is scaled by a biased estimator of its standard deviation. Here, we explore this phenomenon in greater detail in the context of case-control studies with possibly skewed covariates (e.g., genetic burden).
Under the null hypothesis of no association, the square of the Wald statistic (see equation 1) follows a χ2 distribution with 1 degree of freedom. This implies that for any table of observed counts, we can compute the −log10P value and an estimate of the effect size. Therefore, we can express the −log10P value as a function of the effect size (see Figure 1). As the effect size increases, the −log10P values from both the Wald test and the likelihood ratio test (LRT) increase. However, when the effect size becomes large (e.g., when the log odds ratio exceeds 2 for 2 × 2 contingency tables), the −log10P values begin to diverge. Specifically, the −log10P values from the LRT continue to increase, but the −log10P values from the Wald test decrease dramatically. This aberrant behavior of the Wald test at large effect sizes is known as the HDE (see Yee (10)). Moreover, because small case-control studies often require large effect sizes to reach statistical significance (especially when multiple tests are involved), such studies could be substantially underpowered due to the HDE.
A Standard 2 × 2 Contingency Table That Summarizes the Relationship Between a Risk Factor and the Suspected Outcomea
Suspected or Risk Factor . | Cases . | Controls . |
---|---|---|
Presence | a | c |
Absence | b | d |
Suspected or Risk Factor . | Cases . | Controls . |
---|---|---|
Presence | a | c |
Absence | b | d |
aa, b, c, and d are counts, and “Presence” and “Absence” refer to the presence and absence of a genetic variant, respectively.
A Standard 2 × 2 Contingency Table That Summarizes the Relationship Between a Risk Factor and the Suspected Outcomea
Suspected or Risk Factor . | Cases . | Controls . |
---|---|---|
Presence | a | c |
Absence | b | d |
Suspected or Risk Factor . | Cases . | Controls . |
---|---|---|
Presence | a | c |
Absence | b | d |
aa, b, c, and d are counts, and “Presence” and “Absence” refer to the presence and absence of a genetic variant, respectively.

The −log10P values (y-axis) are shown for the Wald test (solid line) across all possible 2 × 2 contingency tables for different sample size. Note that, while the x-axis is the same for all 3 plots, the range of the y-axis varies for each plot. Because the Wald test can also be computed by logistic regression, it is possible to compute the likelihood ratio test (LRT) from logistic regression as well. Hence, in this simulation, we show the −log10P values from the LRT for comparison (dashed line). Note that the contingency tables are ordered, in that tables with “smaller” effects are near 0, and tables with “larger” effects are near −1/2 and 1/2, respectively. Specifically, we compute the effect size (x-axis) as b/(b + d) − a/(a + c), where a/(a + c) = 1/2 (see Table 1).
Most practitioners are unaware of the danger that the HDE poses to small case-control studies for several reasons. First, the HDE is rarely mentioned in conjunction with the Wald test, and the Wald test is routinely performed (by default) in most generalized linear models, especially when the data contain multiple covariates. Second, the HDE (when present) is easily missed by traditional summary plots (e.g., Manhattan plots) that tend to emphasize the magnitude of the smallest P values and not the distribution of all P values. Third, there are very few examples in the literature of the negative impact of the HDE on case-control studies because: 1) small studies are often underpowered (as mentioned above), and 2) for published studies, researchers either do not know that the Wald test should be avoided, or they elect to use tests that do not exhibit the HDE (e.g., likelihood ratio tests and score tests).
Here, in the context of logistic regression and the Wald test, we seek to further investigate (and raise awareness of) the negative effect of the HDE in small case-control studies involving possibly skewed covariates (e.g., genetic burden). It is important to test covariates that summarize genetic variation across multiple sites (e.g., genetic burden) within a gene, pathway, or predefined region because single nucleotide polymorphisms (SNPs) only summarize genetic variation at a single site (11, 12) and explain only a small fraction of the heritability of common, complex traits (13, 14). We also provide a heuristic argument for the cause of the HDE in genetic burden association studies. Last, in a reanalysis of real data, we illustrate the potential benefits of avoiding: 1) the negative impact of the HDE, and 2) the incorrect use of large-sample distributional assumptions. In particular, the permutation test provides increased power over the Wald and likelihood ratio tests.
DATA DESCRIPTION
We reanalyzed a neonatal complications (NC) study with data for 131 preterm White infants recruited through the Perinatal Research Repository at Nationwide Children’s Hospital and born between 27 and 31 completed weeks of gestation (15). Of these 131 infants, 75 suffered from NC (e.g., bronchopulmonary dysplasia, retinopathy of prematurity, intraventricular hemorrhage, and necrotizing enterocolitis), while 56 infants did not. Moreover, the gestational age of each infant was observed, and the exome of each infant was sequenced across 273,168 variable sites.
Furthermore, to study the Wald test and the HDE in the absence of association, we constructed 2 homogeneous data sets from existing whole-exome sequencing data. The first data set consists of 54 reference samples of European ancestry randomly chosen from the 1000 Genomes phase 3 data release (16). For each sample we extracted sequence data at the same 273,168 variable sites assayed in the NC study. Then, to mimic a case-control study design, we labeled 27 of the 54 samples as cases (i.e., “pseudo-cases”). The remaining 27 samples were labeled as controls (i.e., “pseudo-controls”). As such, we refer to this data set as the 1000 Genomes (1KG) pseudo-null. The second homogeneous data set consists of whole-exome sequencing data from 74 of the 75 preterm infants in the aforementioned NC study (see above). One infant was dropped to keep the study balanced. We randomly labeled 37 individuals as pseudo-cases and labeled the remaining 37 individuals as pseudo-controls. Hereafter, we refer to this data set as the NC pseudo-null. Because the labels for both 1KG pseudo-null and NC pseudo-null were assigned randomly, pseudo-case and pseudo-control status is independent of the sequence data in these data sets. Hence, these data follow the null hypothesis of no association.
METHODS
The Wald statistic (denoted W) is |${\hat{\beta}}_{bur}/ se({\hat{\beta}}_{bur})$| where |${\hat{\beta}}_{bur}$| is the maximum likelihood estimate of |${\beta}_{bur}$| and se(|${\hat{\beta}}_{bur}$|) denotes standard error of |${\hat{\beta}}_{bur}$|.
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0039 | 0.0114 | 0.0084 | 0.0044 | 0.0108 | 0.0078 |
2.5 | 0.0182 | 0.0294 | 0.0236 | 0.0154 | 0.0284 | 0.0219 |
5.0 | 0.0403 | 0.0547 | 0.0455 | 0.0406 | 0.0562 | 0.0471 |
10.0 | 0.0951 | 0.1114 | 0.0980 | 0.0913 | 0.1092 | 0.0950 |
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0039 | 0.0114 | 0.0084 | 0.0044 | 0.0108 | 0.0078 |
2.5 | 0.0182 | 0.0294 | 0.0236 | 0.0154 | 0.0284 | 0.0219 |
5.0 | 0.0403 | 0.0547 | 0.0455 | 0.0406 | 0.0562 | 0.0471 |
10.0 | 0.0951 | 0.1114 | 0.0980 | 0.0913 | 0.1092 | 0.0950 |
Abbreviations: 1KG, 1000 Genomes Project; LRT, likelihood ratio test; NC, neonatal complications.
a Type I error approximated for the 1KG (16) and the NC (15) pseudo-null data sets on the basis of P values computed across 20,000 genes using the Wald test, the LRT, and the Firth method. Here, minor allele counts were summed across variants without using weights. Note that the type I errors of the Wald test measured at 1%, 2.5%, and 5% are well below the expected levels, whereas the type I errors of the LRT and Firth method are more consistent with the expectations.
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0039 | 0.0114 | 0.0084 | 0.0044 | 0.0108 | 0.0078 |
2.5 | 0.0182 | 0.0294 | 0.0236 | 0.0154 | 0.0284 | 0.0219 |
5.0 | 0.0403 | 0.0547 | 0.0455 | 0.0406 | 0.0562 | 0.0471 |
10.0 | 0.0951 | 0.1114 | 0.0980 | 0.0913 | 0.1092 | 0.0950 |
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0039 | 0.0114 | 0.0084 | 0.0044 | 0.0108 | 0.0078 |
2.5 | 0.0182 | 0.0294 | 0.0236 | 0.0154 | 0.0284 | 0.0219 |
5.0 | 0.0403 | 0.0547 | 0.0455 | 0.0406 | 0.0562 | 0.0471 |
10.0 | 0.0951 | 0.1114 | 0.0980 | 0.0913 | 0.1092 | 0.0950 |
Abbreviations: 1KG, 1000 Genomes Project; LRT, likelihood ratio test; NC, neonatal complications.
a Type I error approximated for the 1KG (16) and the NC (15) pseudo-null data sets on the basis of P values computed across 20,000 genes using the Wald test, the LRT, and the Firth method. Here, minor allele counts were summed across variants without using weights. Note that the type I errors of the Wald test measured at 1%, 2.5%, and 5% are well below the expected levels, whereas the type I errors of the LRT and Firth method are more consistent with the expectations.
Approximate Type I Error for Wald Test and Likelihood Ratio Test Using Weightsa
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0043 | 0.0119 | 0.0094 | 0.0039 | 0.0129 | 0.0087 |
2.5 | 0.0160 | 0.0291 | 0.0230 | 0.0118 | 0.0290 | 0.0183 |
5.0 | 0.0394 | 0.0555 | 0.0457 | 0.0321 | 0.0601 | 0.0429 |
10.0 | 0.0912 | 0.1094 | 0.0956 | 0.0809 | 0.1158 | 0.0877 |
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0043 | 0.0119 | 0.0094 | 0.0039 | 0.0129 | 0.0087 |
2.5 | 0.0160 | 0.0291 | 0.0230 | 0.0118 | 0.0290 | 0.0183 |
5.0 | 0.0394 | 0.0555 | 0.0457 | 0.0321 | 0.0601 | 0.0429 |
10.0 | 0.0912 | 0.1094 | 0.0956 | 0.0809 | 0.1158 | 0.0877 |
Abbreviations: 1KG, 1000 Genomes; LRT, likelihood ratio test; NC, neonatal complications.
a Type I error approximated for the 1KG (16) and the NC (15) pseudo-null data sets on the basis of P values computed across 20,000 genes using the Wald test, the LRT, and the Firth method. Here, variants are weighted inversely proportional to their allele frequencies (See equation 2). Note that the type I errors of the Wald test measured at 1% and 2.5%, 5% are well below the expected |$\alpha$|levels, whereas the type I errors of the LRT and Firth are more consistent with the expectations.
Approximate Type I Error for Wald Test and Likelihood Ratio Test Using Weightsa
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0043 | 0.0119 | 0.0094 | 0.0039 | 0.0129 | 0.0087 |
2.5 | 0.0160 | 0.0291 | 0.0230 | 0.0118 | 0.0290 | 0.0183 |
5.0 | 0.0394 | 0.0555 | 0.0457 | 0.0321 | 0.0601 | 0.0429 |
10.0 | 0.0912 | 0.1094 | 0.0956 | 0.0809 | 0.1158 | 0.0877 |
Test Level, % . | 1KG Pseudo-Null (n = 54) . | NC Pseudo-Null (n = 74) . | ||||
---|---|---|---|---|---|---|
Wald . | LRT . | Firth . | Wald . | LRT . | Firth . | |
1.0 | 0.0043 | 0.0119 | 0.0094 | 0.0039 | 0.0129 | 0.0087 |
2.5 | 0.0160 | 0.0291 | 0.0230 | 0.0118 | 0.0290 | 0.0183 |
5.0 | 0.0394 | 0.0555 | 0.0457 | 0.0321 | 0.0601 | 0.0429 |
10.0 | 0.0912 | 0.1094 | 0.0956 | 0.0809 | 0.1158 | 0.0877 |
Abbreviations: 1KG, 1000 Genomes; LRT, likelihood ratio test; NC, neonatal complications.
a Type I error approximated for the 1KG (16) and the NC (15) pseudo-null data sets on the basis of P values computed across 20,000 genes using the Wald test, the LRT, and the Firth method. Here, variants are weighted inversely proportional to their allele frequencies (See equation 2). Note that the type I errors of the Wald test measured at 1% and 2.5%, 5% are well below the expected |$\alpha$|levels, whereas the type I errors of the LRT and Firth are more consistent with the expectations.
For large samples, the Wald test statistic (denoted W2) follows a χ2 distribution with 1 degree of freedom under the null hypothesis of no association. Similarly, the LRT statistic is 2[log L (|${\hat{\beta}}_{int},{\hat{\beta}}_{bur}$|) – log L (|${\hat{\beta}}_{int}$|)], and it also follows a χ2 distribution with 1 degree of freedom under the null hypothesis of no association (for large samples). We also considered the Firth test (21, 22), which penalizes the likelihood function to reduce the bias in the estimate for small samples. The Firth test is implemented in the R (R Foundation for Statistical Computing, Vienna, Austria) package “logistf” (23, 24).
To perform the permutation test, we permuted case-control labels more than 100,000,000 times per gene. Then, |${\hat{\beta}}_{bur}$| (i.e., the observed burden coefficient) was compared with its permutation distribution to compute the permutation-based P value. For each gene, we used the permutation-based P value as the gold standard because the permutation-based P value is valid for small sample sizes (25–27) and because it does not suffer from the HDE. Specifically, for each gene, we measured the HDE by computing the ratio of the P value from the Wald and the P value from the permutation test. We refer to this quantity as the gene-specific inflation factor (IF). Then, in the absence of association (i.e., for the 1KG pseudo-null and the NC pseudo-null), we used the permutation-based P values to estimate the approximate type 1 error of the Wald, Firth, and likelihood ratio tests. Last, in the reanalysis of real data, we used the permutation-based P values to illustrate differences in relative power between all 4 tests: permutation, LRT, Firth, and Wald.

Examples of gene-specific distributions of burden from the 1000 Genomes Project, phase 3 data release (16), are shown. These genes were selected on the basis of their inflation factor (IF). IF is the ratio of the Wald-based P values divided by the permutation-based P values. The burden distributions for genes with high IFs (A and B) are typically characterized by a pronounced skew and by the presence of private levels. By contrast, genes with low IFs (C and D) show much less skew and lack of private levels. BEST2, bestrophin-2; SMIM12, small integral membrane protein 12; WDR85, WD repeat-containing protein 85.

Examples of gene-specific distributions of burden from the neonatal complications (15) pseudo-null data set are shown. These genes were selected on the basis of their inflation factor (IFs). IF is the ratio of the Wald-based P values divided by the permutation-based P values. The burden distributions for genes with high IFs (A and B) are typically characterized by a pronounced skew and by the presence of private levels. By contrast, genes with low IFs (C and D) show much less skew, and lack of private levels. ITPRIP, inositol 1,4,5-trisphosphate receptor interacting protein; LINC00544, long intergenic non--protein coding RNA 544; MIR191, microRNA 191; SLC29A2, solute carrier family 29 member 2.
RESULTS
For ease of exposition, we have organized Results into 3 subsections. The first subsection shows that, for small samples, the HDE exists when testing burden for association. In the second subsection, we provide a heuristic argument for what may be causing the HDE in the first subsection, and we provide evidence that is consistent with our claim. The third subsection illustrates the practical utility of circumventing the HDE and large sample approximations when testing a covariate for association in small samples, irrespective of whether the sample size is limited by budget constraints, and/or by the rarity of the total population size from which the sample is drawn.
Wald test suffers from the HDE
The approximate type 1 errors for Wald, Firth, and likelihood ratio tests at the 1%, 2.5%, 5%, and 10% α-levels of significance are shown in Table 2 (with equal weights) and Table 3 (with unequal weights). Because the HDE manifests as an inflation of small P values, our results show that the Wald test is not immune to HDE when testing burden for association in the publicly available reference samples and in our selected sample of preterm infants. By contrast, the Firth and likelihood ratio tests are immune to HDE. Note that the HDE is most apparent at the 1%, 2.5%, and 5% levels of significance, for both data sets and irrespective of whether the analysis was performed with equal or unequal weights (Tables 2 and 3).
Genetic burden, private levels, and the HDE
To understand the impact of the HDE at the single-gene level when testing burden for association, we examined the burden distributions of 8 genes with permutation test P values less than 0.005. Each gene was selected on the basis of its IF (inflation factor). Specifically, we chose 4 genes from each pseudo-null data set (Figures 2 and 3), and within each set of 4 genes, 2 had relatively large IFs (i.e., they suffered greatly from the HDE), while 2 had IFs close to one (i.e., they did not appear to suffer from the HDE).
From the 1KG pseudo-null data set we selected 4 genes with IFs 14, 12, 2, and 2 respectively (see Figure 2). The burden distributions of all 4 genes show considerable evidence of skew, but only the genes with high IFs (see Figure 2A–B) have an abundance of private levels of burden, where a private level indicates a burden level that is observed in one group but not the other. For example, in Figure 2B, burden levels 4, 5, 6,11, 22, 23 are private to pseudo-cases, whereas the burden levels 0, 1, 2, and 3 are not private (i.e., common to both pseudo-cases and pseudo-controls). Similarly, Figure 2A also exhibits private levels of burden at 20, 22, 23, 24, 25, and 40 in pseudo-cases and at 0 and 6 in pseudo-controls. However, by contrast, for genes with low IFs (Figure 2C–D), their burden distributions lack of private levels, and the genetic burden levels are matched across pseudo-cases and pseudo-controls.
Similar patterns of the presence/absence (and distribution) of private levels can be seen in the 4 genes selected from the NC pseudo-null (see Figure 3). Particularly, burden values of 1, 5, and 9 are private to pseudo-controls in Figure 3B and burden values of 3, 9, 10, and 11 are private to pseudo-cases in Figure 3A. The latter 2 cases have resulted in high inflation factors indicating the presence of HDE, whereas the private levels are evenly distributed across pseudo-cases and pseudo-controls at low inflation factors (Figure 3C–D). Furthermore, we also encountered cases where the IF is high (IF = 23, IF = 21), but the evidence for skew is low (Figure 4A–B). Here, the burden variable is equivalent to the customary minor allele count variable (i.e., the “dosage”) used in SNP-based association studies. For example, in Figure 4A there are exactly 3 levels of observed burden (0, 1, and 2) for pseudo-cases, and 2 levels of observed burden for pseudo-controls (0 and 1). This result, and the work of others (see Figure 5A, Ma et al. (7)), shows that SNP-based association studies are not necessarily immune to the HDE.

Burden distribution for 2 genes: Mir_320 (A) and TRNA (B), using data from a study of neonatal complications (15). The distributions of burden at these genes show little evidence for skew, but the Hauck-Donner effect is still present (inflation factors (IFs) of 23 and 21, respectively). Moreover, the distributions of burden shown are reminiscent of the distributions of minor allele counts (i.e., “dosage”) routinely seen in single nucleotide polymorphism–based association studies.
Given the observed differences in burden distributions between genes with high and low IFs (Figures 2 and 3), our results suggest that the strength of the HDE is associated with the existence and distribution of private levels in burden. However, because association does not imply causation, we give (immediately below) a simple heuristic argument that directly ties the existence and distribution of private levels of burden to the strength of the HDE.
Reanalysis of an existing NC study
We reanalyzed the whole-exome sequence data of an existing neonatal complications (NC) study (15) containing 131 preterm White infants (see Data Description, above, for details). Specifically, we computed burden-based P values using the Wald, Firth, likelihood ratio, and permutation tests. For all 4 tests, gestational age was included as an additional covariate.
From Figure 5A, we see that the permutation test identified 10 significant genes (listed in Table 4). However, from Figure 5B–D, we see that no significant genes were found by the likelihood ratio, Firth, or Wald tests. To a large extent, these implied differences in power are expected because the Wald test suffers from both the HDE and the inappropriate use of large sample approximations, whereas the LRT suffers only from the inappropriate use of large sample approximations. Fortunately, the permutation test is immune to the HDE and valid for small samples as well.

Four Manhattan plots showing gene-specific P values on odd (gray) and even (black) numbered chromosomes, using data from a study of neonatal complications (15). P values were computed using the permutation test (A), likelihood ratio test (B), the Firth method (C), and Wald test (D). Only the permutation test (A) identified genes with −log10P values above the exome-wide threshold for statistical significance (dashed line).
Gene . | Permutation Test . | LRT . | Firth . | Wald Test . |
---|---|---|---|---|
RANBP2 | 5.0 × 10–7 | 3.6 × 10–5 | 5.4 × 10–5 | 2.1 × 10–4 |
AK316092 | 1.0 × 10–6 | 1.4 × 10–5 | 6.6 × 10–5 | 9.7 × 10–3 |
GPR26 | 1.0 × 10–6 | 1.8 × 10–5 | 2.9 × 10–5 | 2.1 × 10–4 |
CCDC138 | 3.5 × 10–6 | 1.0 × 10–4 | 1.4 × 10–4 | 4.2 × 10–4 |
HIF3A | 3.5 × 10–6 | 1.2 × 10–3 | 1.7 × 10–3 | 5.5 × 10–3 |
GUCY1A3 | 5.5 × 10–6 | 1.2 × 10–4 | 2 × 10–4 | 6.5 × 10–4 |
CLDN15 | 5.5 × 10–6 | 4.5 × 10–4 | 8.4 × 10–4 | 4.1 × 10–3 |
SPINK1 | 7.0 × 10–6 | 1.3 × 10–4 | 2.2 × 10–4 | 9.6 × 10–4 |
DQ579288 | 1.1 × 10–5 | 4.0 × 10–4 | 6 × 10–4 | 1.5 × 10–3 |
MIR548A3 | 1.2 × 10–5 | 2.3 × 10–4 | 5.4 × 10–5 | 2.1 × 10–3 |
Gene . | Permutation Test . | LRT . | Firth . | Wald Test . |
---|---|---|---|---|
RANBP2 | 5.0 × 10–7 | 3.6 × 10–5 | 5.4 × 10–5 | 2.1 × 10–4 |
AK316092 | 1.0 × 10–6 | 1.4 × 10–5 | 6.6 × 10–5 | 9.7 × 10–3 |
GPR26 | 1.0 × 10–6 | 1.8 × 10–5 | 2.9 × 10–5 | 2.1 × 10–4 |
CCDC138 | 3.5 × 10–6 | 1.0 × 10–4 | 1.4 × 10–4 | 4.2 × 10–4 |
HIF3A | 3.5 × 10–6 | 1.2 × 10–3 | 1.7 × 10–3 | 5.5 × 10–3 |
GUCY1A3 | 5.5 × 10–6 | 1.2 × 10–4 | 2 × 10–4 | 6.5 × 10–4 |
CLDN15 | 5.5 × 10–6 | 4.5 × 10–4 | 8.4 × 10–4 | 4.1 × 10–3 |
SPINK1 | 7.0 × 10–6 | 1.3 × 10–4 | 2.2 × 10–4 | 9.6 × 10–4 |
DQ579288 | 1.1 × 10–5 | 4.0 × 10–4 | 6 × 10–4 | 1.5 × 10–3 |
MIR548A3 | 1.2 × 10–5 | 2.3 × 10–4 | 5.4 × 10–5 | 2.1 × 10–3 |
Abbreviations: CCDC138, coiled-coil domain-containing 138; CLDN15, claudin-15; GPR26, G protein-coupled receptor 26; GUCY1A3, guanylate cyclase 1 soluble subunit alpha-3; HIF3A, hypoxia-inducible factor 3 subunit alpha; LRT, likelihood ratio test; MIR548A3, microRNA 548a-3; RANBP2, RAN binding protein 2; SPINK1, serine protease inhibitor, Kazal type 1.
a The list of the top 10 genes using the permutation test (100,000,000 permutations), and the corresponding P values using the LRT, the Firth test, and the Wald test. This analysis uses data from a study of neonatal complications (15).
Gene . | Permutation Test . | LRT . | Firth . | Wald Test . |
---|---|---|---|---|
RANBP2 | 5.0 × 10–7 | 3.6 × 10–5 | 5.4 × 10–5 | 2.1 × 10–4 |
AK316092 | 1.0 × 10–6 | 1.4 × 10–5 | 6.6 × 10–5 | 9.7 × 10–3 |
GPR26 | 1.0 × 10–6 | 1.8 × 10–5 | 2.9 × 10–5 | 2.1 × 10–4 |
CCDC138 | 3.5 × 10–6 | 1.0 × 10–4 | 1.4 × 10–4 | 4.2 × 10–4 |
HIF3A | 3.5 × 10–6 | 1.2 × 10–3 | 1.7 × 10–3 | 5.5 × 10–3 |
GUCY1A3 | 5.5 × 10–6 | 1.2 × 10–4 | 2 × 10–4 | 6.5 × 10–4 |
CLDN15 | 5.5 × 10–6 | 4.5 × 10–4 | 8.4 × 10–4 | 4.1 × 10–3 |
SPINK1 | 7.0 × 10–6 | 1.3 × 10–4 | 2.2 × 10–4 | 9.6 × 10–4 |
DQ579288 | 1.1 × 10–5 | 4.0 × 10–4 | 6 × 10–4 | 1.5 × 10–3 |
MIR548A3 | 1.2 × 10–5 | 2.3 × 10–4 | 5.4 × 10–5 | 2.1 × 10–3 |
Gene . | Permutation Test . | LRT . | Firth . | Wald Test . |
---|---|---|---|---|
RANBP2 | 5.0 × 10–7 | 3.6 × 10–5 | 5.4 × 10–5 | 2.1 × 10–4 |
AK316092 | 1.0 × 10–6 | 1.4 × 10–5 | 6.6 × 10–5 | 9.7 × 10–3 |
GPR26 | 1.0 × 10–6 | 1.8 × 10–5 | 2.9 × 10–5 | 2.1 × 10–4 |
CCDC138 | 3.5 × 10–6 | 1.0 × 10–4 | 1.4 × 10–4 | 4.2 × 10–4 |
HIF3A | 3.5 × 10–6 | 1.2 × 10–3 | 1.7 × 10–3 | 5.5 × 10–3 |
GUCY1A3 | 5.5 × 10–6 | 1.2 × 10–4 | 2 × 10–4 | 6.5 × 10–4 |
CLDN15 | 5.5 × 10–6 | 4.5 × 10–4 | 8.4 × 10–4 | 4.1 × 10–3 |
SPINK1 | 7.0 × 10–6 | 1.3 × 10–4 | 2.2 × 10–4 | 9.6 × 10–4 |
DQ579288 | 1.1 × 10–5 | 4.0 × 10–4 | 6 × 10–4 | 1.5 × 10–3 |
MIR548A3 | 1.2 × 10–5 | 2.3 × 10–4 | 5.4 × 10–5 | 2.1 × 10–3 |
Abbreviations: CCDC138, coiled-coil domain-containing 138; CLDN15, claudin-15; GPR26, G protein-coupled receptor 26; GUCY1A3, guanylate cyclase 1 soluble subunit alpha-3; HIF3A, hypoxia-inducible factor 3 subunit alpha; LRT, likelihood ratio test; MIR548A3, microRNA 548a-3; RANBP2, RAN binding protein 2; SPINK1, serine protease inhibitor, Kazal type 1.
a The list of the top 10 genes using the permutation test (100,000,000 permutations), and the corresponding P values using the LRT, the Firth test, and the Wald test. This analysis uses data from a study of neonatal complications (15).
Four of our top 10 genes (Table 4) are promising candidate genes for neonatal complications (NC). First, the gene for hypoxia-inducible factor 3 subunit alpha (HIF3A) (ranked 5) is a transcriptional regulator in adaptive response to low oxygen tension (29), and it has been shown repeatedly to influence hypoxia (30, 31), which is characterized by a significant deficiency of oxygen levels in the blood. HIF3A has also been linked by several studies to the development of embryonic and neonatal cardiorespiratory systems (32, 33). Both conditions make HIF3A an attractive candidate to understand bronchopulmonary dysplasia in preterm infants. Second, because nitric oxide is a common treatment given to preterm infants affected with chronic lung disease, it is interesting that the gene for guanylate cyclase 1 soluble subunit alpha-3 (GUCY1A3 ) (ranked 6) interacts with a β subunit to form the nitric oxide–activated enzyme guanylate cyclase (34, 35). As such, it is plausible that mutations in GUCY1A3 could reduce guanylate cyclase levels and/or enzymatic function, which in turn could reduce the efficacy of nitric oxide as a prophylactic for chronic lung disease. Third, our top hit, for RAN binding protein 2 (RANBP2) (ranked 1), and the gene for coiled-coil domain-containing 138 (CCDC138) (ranked 4) form an overlapping gene cluster, where mutations in RANBP2 have been shown to cause familial acute necrotizing encephalopathy, a rare genetic disorder, common in preterm infants (36–39).
DISCUSSION
Limitations: sequencing platforms and approximate type 1 error
One limitation of this research is that the mean heterozygosity (23.7%) across the measured sites in the 1KG pseudo-null data set is 7 times larger than the mean heterozygosity (3.4%) in the NC pseudo-null data set. Thus, the 1KG pseudo-null data set is substantially more variable than the NC pseudo-null data set. To make the two data sets more comparable in terms of their observed HDE, we used fewer samples (n = 54) to construct the 1KG pseudo-null data set compared with the 74 infants with neonatal complications taken from the NC study (i.e., the NC pseudo-null data set). Note that, because improved sequencing methods are expected to detect more genetic variants (40, 41), they could mitigate the negative effect of the HDE in small case-control association studies.
Another limitation arises because the distribution of P values under the null hypothesis is never uniform for discrete data (e.g., binary outcomes with a discrete predictor). As such, the finite sampling distribution of P values under the null hypothesis is both unknown and dependent on sample size. Therefore, we used the distribution of P values from the permutation test as a proxy for the unknown finite sampling distribution of P values under the null hypothesis. Because the permutation test P values of the 1KG pseudo-null data set are not statistically different from a uniform distribution (P > 0.31), and because the permutation test P values of our NC pseudo-null data set are not statistically different from a uniform distribution (P > 0.45), the negative impact of using the permutation test as a proxy for the unknown finite sampling distribution of P values under the null hypothesis is likely small.
Pseudo-null data sets versus direct simulation under the null
In humans, very little is known about the mutation process that gives rise to rare genetic variation, and this is particularly true for rare variants that occur in highly conserved regions of the genome (e.g., exons). As such, it is difficult to imagine a direct simulation of genetic burden across exonic regions that would bear any resemblance to the actual burden observed in human populations. To overcome this obstacle, we chose to construct pseudo-null data sets (i.e., whole-exome sequence data sets where subjects do not vary, or at least are not known to vary in phenotype). These phenotypically homogeneous data sets must (by construction) follow the null hypothesis of no association, and moreover, they also retain realistic patterns of rare genetic variation.
Our recommendation for analyzing small case-control studies
When the Wald test is used in conjunction with logistic regression (or more generally, with any generalized linear model (10)), small P values may be inflated, especially if the sample size is small, the effects in question are large, and the covariate follows a skewed distribution (e.g., burden). Although this aberrant behavior leads to a reduction in power, it is easily circumvented by using an LRT or permutation test. As such, we strongly recommend that researchers use the permutation test, which outperforms the Wald test and LRT in the context of logistic regression and small samples. In situations where the permutation test is computationally intensive (i.e., the P values are very small), we recommend that users first use an LRT as a screening tool to identify P values that are “promising” (e.g., P < 0.20). Then, for each promising P value, users should determine the number of permutations needed to recalculate more accurate P values that do not suffer from the inappropriate use of large sample approximations. In fact, this is precisely what we did in our real-data analysis, where fewer than 1% of all genes even have a chance of being statistically significant (see Figure 5B); hence, it is computationally inefficient to apply the permutation test to all 23,854 genes. Recall that while LRT P values are immune to HDE (10), they can still suffer from violations of asymptotic normality (especially when sample sizes are small). Therefore, to reduce the computational inefficiency, and to correct (if necessary) the relevant LRT P values, we computed permutation P values only for the top 200 genes. This resulted in an approximate 100× speed-up compared with computing permutation P values for all 23,854 genes. Note that the top 200 genes were selected on the basis of their LRT P values.
Whenever the HDE is present, and whenever asymptotic assumptions are violated in case-control studies, researchers can increase power by using a (possibly computationally efficient) permutation-based approach, as we have done here. In addition to providing significantly shorter analysis times for case-control studies, by using our proposed approach in the reanalysis of existing small-sample studies, and/or in the analysis of newly collected small-sample studies, researchers should be able to:
Increase the rate at which disease-related genes and pathways are discovered.
Increase the chance that existing findings will be successfully replicated.
Identify plausible scientific hypotheses that can be tested in larger follow-up studies.
Significantly increase the likelihood of developing better biomarkers, perform more powerful meta-analyses, and eventually, find novel pathogenic mutations influencing common, complex human diseases.
Therefore, the work that we present here is highly educational, and could dramatically raise the level of awareness of this often-understudied problem. As a direct result of this work, we truly hope that the research community finds renewed purpose and value for small-sample case-control studies, especially when the exposure variable has many levels.
ACKNOWLEDGMENTS
Author affiliations: Department of Biophysics, The Ohio State University, Columbus, Ohio, United States (Komla M. Gnona); and GIG Statistical Consulting LLC, Columbus, Ohio, United States (William C. L. Stewart).
This work was supported by the Abigail Wexner Research Institute at Nationwide Children’s Hospital (grant 50313-0003-1220).
The 1000 Genomes Project data set used in this study can be found in the 1000 Genomes repository at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data. The neonatal complications data set is available upon request.
The views expressed in this article are those of the authors and do not reflect those of the Abigail Wexner Research Center and its members.
Conflict of interest: none declared.