-
PDF
- Split View
-
Views
-
Cite
Cite
Han Zhang, William Wheeler, Lei Song, Kai Yu, Proper joint analysis of summary association statistics requires the adjustment of heterogeneity in SNP coverage pattern, Briefings in Bioinformatics, Volume 19, Issue 6, November 2018, Pages 1337–1343, https://doi.org/10.1093/bib/bbx072
Close -
Share
Abstract
As meta-analysis results published by consortia of genome-wide association studies (GWASs) become increasingly available, many association summary statistics-based multi-locus tests have been developed to jointly evaluate multiple single-nucleotide polymorphisms (SNPs) to reveal novel genetic architectures of various complex traits. The validity of these approaches relies on the accurate estimate of z-score correlations at considered SNPs, which in turn requires knowledge on the set of SNPs assessed by each study participating in the meta-analysis. However, this exact SNP coverage information is usually unavailable from the meta-analysis results published by GWAS consortia. In the absence of the coverage information, researchers typically estimate the z-score correlations by making oversimplified coverage assumptions. We show through real studies that such a practice can generate highly inflated type I errors, and we demonstrate the proper way to incorporate correct coverage information into multi-locus analyses. We advocate that consortia should make SNP coverage information available when posting their meta-analysis results, and that investigators who develop analytic tools for joint analyses based on summary data should pay attention to the variation in SNP coverage and adjust for it appropriately.
Introduction
Genome-wide association study (GWAS) has become an effective way to identify common genetic variants underlying various complex traits. To maximize the chance of discovering novel outcome-associated variants by increasing the sample size, many consortia have been formed to conduct meta-analyses on data across multiple GWAS [1]. The standard way for a GWAS meta-analysis is through the single-locus analysis, which evaluates one single-nucleotide polymorphism (SNP) at a time. As the easily accessible SNP-level association summary statistics (e.g. log odds ratio and its SD) generated from large-scale GWAS meta-analysis become increasingly available, various multi-locus methods have been proposed for conducting summary data-based analyses. One such test is the conditional test [2], which assesses the SNP association by conditioning on SNPs at previously identified loci. Another example is gene- or pathway-based association test [3–10], which jointly evaluates association between the outcome and SNPs at one or multiple genes. Other procedures include summary statistics imputation [11–14], transcriptome-wide association study [15, 16], Mendelian randomization [17] and genetic heritability estimation [18–20]. These popular analytic procedures have yielded new insights into the genetic architectures of complex traits and diseases.
A critical step in many summary data-based multi-locus analyses is to estimate correlations of z-scores, which are usually approximated by genotype correlations in a population-specific reference panel such as 1000 Genomes Project [21], or an existing GWAS. These estimations are accurate only if all participating studies investigate the same set of SNPs. This is not often the case in practice because of different genotyping platforms in use. Although genotype imputation has been routinely applied to individual studies before they are combined for the meta-analysis, heterogeneity in SNP coverage usually still exists in many published meta-analysis results. For example, there is considerably SNP coverage heterogeneity in meta-analysis results published by the successful DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium. We will show in this report that the accurate estimation of z-score correlations requires knowledge of exact SNP coverage information, i.e. the set of SNPs evaluated in each participating study. We demonstrate through numerical experiments and real applications that assuming inaccurate SNP coverage pattern would result in highly inflated type I errors in several popular multi-locus analyses, including the conditional test, and gene/pathway association tests. We also show the proper way to incorporate the correct coverage information into these analyses.
Method
From the definition of z-score correlation , conducting a multi-locus analysis requires the knowledge of the SNP coverage index matrix , as well as the sample size of each participating study. The GWAS Consortium sometimes does not provide the information of for the meta-analysis, but instead releases for each SNP. In some cases, it is possible to uniquely resolve from equations given the sample size of participating studies. For example, we were able to recover the exact coverage information by applying this strategy to the summary data of the type 2 diabetes (T2D) GWAS released by the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium [23]. This strategy, however, could become infeasible when the number of participating studies is large, or study-specific sample sizes are unknown.
Common SNP coverage assumptions
When both and are unknown, one option is to set , which essentially assumes that the same set of SNPs is evaluated by all participating studies. This strategy has been widely used in many multi-locus analysis procedures based on GWAS summary data without an explicit justification [3, 4, 6, 7]. We call this the uniform coverage assumption hereafter.
When is known, but the coverage information is unknown, a commonly used strategy is to assume , which is true when the set of SNPs in participating studies is nested within each other [2, 20]. We call this the nested coverage assumption.
Both uniform and nested coverage assumptions could lead to overestimation of z-score correlations, especially if the SNPs are in moderate or high LD. For example, when two SNPs are exclusively studied in two separate studies with equal sample sizes, the correlation between their meta-analysis z-scores is strictly 0 (i.e. ). This correlation, however, is estimated to be if a uniform or nested coverage pattern is inappropriately assumed. In the following, we will describe how the bias in estimating the z-score correlation would affect some commonly used multi-locus tests.
Conditional test based on GWAS summary data
Note that is the conditional variance of , given that the marginal association evidence has been observed at the index SNP .
Overestimating the correlation because of the inappropriate assumptions can enormously change the value of in both directions. The direction depends on the relative magnitude of changes caused by the bias in the nominator and denominator of . It could lead to either type I error inflation or power reduction on the testing SNPs. To illustrate this, we consider two SNPs studied with equal sample size . There are subjects having genotypes on both SNPs, where the proportion parameter . In this case, making the uniform coverage assumption is equivalent to making the nested one. Under both assumptions, is estimated to be , no matter what the value takes. In contrast, its consistent estimate based on the accurate coverage information is . We fix at the index SNP, i.e. the index SNP reaches the common genome-wide significant P-value threshold (), and . The z-score of target SNP varies from 0 to 5. The ratio of conditional P-values (in scale) between the one using the wrong estimate and the one using the consistent estimate is showed in Figure 1. In the plot, ratios that are >10 are truncated to be 10 for the sake of clear presentation. From Figure 1, we have two main observations. First, the conditional test assuming the uniform or the nested coverage pattern tends to generate a P-value that is smaller than its true value when the marginal association evidence at the target SNP is moderate or weak (e.g. ). The significance of the P-value becomes more exaggerated as c decreases (i.e. fewer subjects having their genotyped measured on both SNPs). Second, the conditional test could lose its power under the wrong assumptions when the z-score at the target SNP is relatively large (e.g. ). We will show through real data analysis later that the type I error of the conditional test assuming incorrect coverage pattern is more likely to be inflated in practice, as most testing SNPs have relatively small marginal z-scores.
Contour plot of the ratio of conditional test P-values (in scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: z-score at the target SNP; y-axis: the proportion parameter as defined in the main text.
Contour plot of the ratio of conditional test P-values (in scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: z-score at the target SNP; y-axis: the proportion parameter as defined in the main text.
Gene-based association test based on GWAS summary data
Contour plot of the ratio of the gene-level P-values (in scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: observed value of the minP test statistic; y-axis: the proportion parameter as defined in the main text.
Contour plot of the ratio of the gene-level P-values (in scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: observed value of the minP test statistic; y-axis: the proportion parameter as defined in the main text.
Results
In the following, we show through two real data sets how the performance of conditional test, and the gene- and pathway-based association tests, can be affected by different SNP coverage patterns. On both data sets, imputation was performed in each of the participating studies before conducting the meta-analyses. One data set is the meta-analysis result on 2.5 million SNPs generated from the DIAGRAM consortium, which consists of 12 171 T2D cases and 56 862 controls across 12 GWAS with European ancestry [23]. The other data set is the meta-analysis result on 2.6 million SNPs generated by the Asian Genetic Epidemiology Network (AGEN) consortium, which combines 8 GWAS of T2D with 6952 and 11 865 controls from Eastern Asian populations [25]. On both data sets, we exclude SNPs within kb of previously GWAS-identified T2D loci, so that potential association variants caused by LD with established loci are eliminated from our analysis. The DIAGRAM and AGEN meta-analysis results have genomic control inflation factors and , respectively. We further adjusted for these genome-wide inflations in our analysis by dividing the z-score by at each SNP [5]. The purpose of excluding established loci and adjusting for is to mimic the situation where there is no enriched association signal, so that inflation observed in testing P-values can be attributed to inappropriate assumptions on SNP coverage. The sample size of each SNP in DIAGRAM and AGEN data sets is showed in Figure 3. The SNP indexes are arranged per their sample sizes . Compared with the AGEN data, the DIAGRAM data exhibit less but still noticeable heterogeneity in the SNP coverage across participating studies. Specifically, 19.0% of the SNPs are genotyped or imputed for all subjects in the DIAGRAM study; another 57.9% of the SNPs are available on 9580 cases and 53 810 controls. The nested coverage assumption is valid for SNPs in these two groups. To estimate genotype correlations, we used a panel of 503 European subjects (CEU, TSI, FIN, GBR and IBS), and a panel of 312 Eastern Asian subjects (CHB, CHS and JPT) from 1000 Genomes for analyzing DIAGRAM and AGEN data sets, respectively.
Sample sizes for each of SNPs in the (A) DIAGRAM and (B) AGEN studies. The SNP indexes are arranged per their sample sizes.
Sample sizes for each of SNPs in the (A) DIAGRAM and (B) AGEN studies. The SNP indexes are arranged per their sample sizes.
For the conditional test, we randomly picked regions of 100 kb in length throughout the genome. In each region, we focused on SNP pairs with squared Pearson’s correlation coefficients in the range of 0.5–0.9, and with minor allele frequencies (MAFs) > 1%. The conditional test was performed for every SNP pair with correct coverage information (https://CRAN.R-project.org/package=SCAT), as well as under the nested and uniform coverage assumptions. In total, 5 million unique SNP pairs were tested. For the gene-based analysis, we illustrated our results by applying the minP method [3] on >17 000 genes, which were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13 using the Ensemble BioMart tool. For the pathway-based analysis, we applied the summary based Adaptive Rank Truncated Product (sARTP) method [5] on 4771 human and murine (mammalian) pathways (gene sets) collected from the MSigDB v5.0 (C2: curated gene sets). Like most pathway approaches, the sARTP method integrates association evidence from individual genes. We chose to use the minP statistic in sARTP to summarize gene-level association.
Figure 4 illustrates the Q-Q plots of conditional test P-values on two data sets. The conditional P-values using exact, nested or uniform coverages have inflation factors 0.99, 1.13 and 1.37 on DIAGRAM, and 1.00, 1.08 and 1.15 on AGEN. Substantial inflation emerges under the two inappropriate SNP coverage assumptions, even after z-scores have been rescaled by the square root of genomic control inflation factor, and after established T2D loci and their nearby SNPs have been excluded. Note that in Figure 4, points on the tails of Q-Q plots that show large deviation from the null represent only a small fraction of all 5 million points. The inflation vanishes, i.e. is close to 1.0 when the exact coverage information is used. Table 1 shows the proportions of conditional test P-values below certain thresholds. Although the nested coverage assumption gives less inflated P-values than those under the uniform coverage assumption, both assumptions lead to much higher false-positive rates than the one using correct coverage information.
Empirical type I errors of conditional tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.10 | 0.13 | 0.17 | 0.10 | 0.12 | 0.13 |
| 0.01 | 0.015 | 0.023 | 0.046 | 0.011 | 0.017 | 0.026 |
| 0.001 | 0.0026 | 0.0058 | 0.017 | 0.0013 | 0.0034 | 0.0084 |
| 0.0001 | 0.00062 | 0.0019 | 0.0080 | 0.00023 | 0.0011 | 0.0041 |
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.10 | 0.13 | 0.17 | 0.10 | 0.12 | 0.13 |
| 0.01 | 0.015 | 0.023 | 0.046 | 0.011 | 0.017 | 0.026 |
| 0.001 | 0.0026 | 0.0058 | 0.017 | 0.0013 | 0.0034 | 0.0084 |
| 0.0001 | 0.00062 | 0.0019 | 0.0080 | 0.00023 | 0.0011 | 0.0041 |
Empirical type I errors of conditional tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.10 | 0.13 | 0.17 | 0.10 | 0.12 | 0.13 |
| 0.01 | 0.015 | 0.023 | 0.046 | 0.011 | 0.017 | 0.026 |
| 0.001 | 0.0026 | 0.0058 | 0.017 | 0.0013 | 0.0034 | 0.0084 |
| 0.0001 | 0.00062 | 0.0019 | 0.0080 | 0.00023 | 0.0011 | 0.0041 |
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.10 | 0.13 | 0.17 | 0.10 | 0.12 | 0.13 |
| 0.01 | 0.015 | 0.023 | 0.046 | 0.011 | 0.017 | 0.026 |
| 0.001 | 0.0026 | 0.0058 | 0.017 | 0.0013 | 0.0034 | 0.0084 |
| 0.0001 | 0.00062 | 0.0019 | 0.0080 | 0.00023 | 0.0011 | 0.0041 |
The Q-Q plots for the conditional test P-values on 5 million pairs of SNPs in the (A) DIAGRAM and (B) AGEN studies. For each pair of SNPs, the squared Pearson’s correlation coefficient is in the range of 0.5–0.9, and has MAF > 1%.
The Q-Q plots for the conditional test P-values on 5 million pairs of SNPs in the (A) DIAGRAM and (B) AGEN studies. For each pair of SNPs, the squared Pearson’s correlation coefficient is in the range of 0.5–0.9, and has MAF > 1%.
The Q-Q plots in Figure 5 and the empirical type I errors in Table 2 suggest that the minP gene-based test under the nested coverage assumption seems to work reasonably well, compared with the one using exact coverage information, especially in the DIAGRAM data set. However, further inspection reveals that 75.4 and 97.1% of the genes, using the DIAGRAM and AGEN data, respectively, have smaller P-values if the nested coverage rather than the exact coverage is assumed. The gene-level inflation factors assuming the exact, nested and uniform coverages for DIAGRAM data are 0.96, 0.97 and 1.16, and for AGEN data are 1.15, 1.26 and 2.03. These observations suggest that procedures making inappropriate coverage assumptions would overestimate the significance of most genes, especially for the AGEN study, in which the SNP coverage pattern is more varied than that in the DIAGRAM study.
Empirical type I errors of gene-based tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.11 | 0.11 | 0.12 | 0.11 | 0.12 | 0.16 |
| 0.01 | 0.015 | 0.015 | 0.016 | 0.014 | 0.015 | 0.021 |
| 0.001 | 0.0025 | 0.0025 | 0.0030 | 0.0017 | 0.0017 | 0.0024 |
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.11 | 0.11 | 0.12 | 0.11 | 0.12 | 0.16 |
| 0.01 | 0.015 | 0.015 | 0.016 | 0.014 | 0.015 | 0.021 |
| 0.001 | 0.0025 | 0.0025 | 0.0030 | 0.0017 | 0.0017 | 0.0024 |
Empirical type I errors of gene-based tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.11 | 0.11 | 0.12 | 0.11 | 0.12 | 0.16 |
| 0.01 | 0.015 | 0.015 | 0.016 | 0.014 | 0.015 | 0.021 |
| 0.001 | 0.0025 | 0.0025 | 0.0030 | 0.0017 | 0.0017 | 0.0024 |
| Nominal level . | DIAGRAM . | AGEN . | ||||
|---|---|---|---|---|---|---|
| Exact . | Nested . | Uniform . | Exact . | Nested . | Uniform . | |
| 0.1 | 0.11 | 0.11 | 0.12 | 0.11 | 0.12 | 0.16 |
| 0.01 | 0.015 | 0.015 | 0.016 | 0.014 | 0.015 | 0.021 |
| 0.001 | 0.0025 | 0.0025 | 0.0030 | 0.0017 | 0.0017 | 0.0024 |
The Q-Q plots for the minP test applied to the (A) DIAGRAM and (B) AGEN studies. The genes were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13.
The Q-Q plots for the minP test applied to the (A) DIAGRAM and (B) AGEN studies. The genes were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13.
We next show how the bias in a gene-based analysis could be further amplified in a pathway-based analysis, which accumulates association signals from multiple genes, and hence aggregates the gene-level bias. Figure 6 shows the Q-Q plots for pathway-based analyses of 4771 candidate pathways. Like the gene-based analysis, the difference among pathway analyses under three SNP coverage assumptions is more obvious in the AGEN study than those in the DIAGRAM study. The AGEN study under the nested and uniform coverage assumptions has inflations and 7.29, respectively, much higher than that using correct coverage information (1.39). This implies that significant pathways identified under inappropriate coverage assumptions could be false positives. The inflation for the pathway analyses using correct coverage information is higher than expected ( for DIAGRAM and for AGEN) when compared with the inflation observed in typical GWAS single-locus analysis because of two main reasons. First, the pathway P-values are correlated, as there are substantial overlaps in genes among tested pathways; hence, the inflation factor for a pathway analysis could have larger variation than that of a single-locus analysis. Second, even though the GWAS established loci have been excluded, and the genomic control inflation factor accounted for, the pathway analysis can aggregate weak association evidence from the remaining SNPs to detect association-enriched pathways. More discussion on this phenomenon can be found in [5].
The Q-Q plots for the sARTP test applied to the (A) DIAGRAM and (B) AGEN studies. The pathways (gene sets) were collected from the MSigDB v5.0 (C2: curated gene sets).
The Q-Q plots for the sARTP test applied to the (A) DIAGRAM and (B) AGEN studies. The pathways (gene sets) were collected from the MSigDB v5.0 (C2: curated gene sets).
Discussion
In summary, the validity of the multi-locus analyses based on GWAS summary data depends on how accurate the correlations of z-scores are estimated, which in turn requires knowledge on the SNP coverage in individual studies participating in the meta-analysis. When the SNP coverage information is unknown, our results demonstrate that three commonly used multi-locus analyses under oversimplified SNP coverage assumptions could suffer from inflated type I errors. This issue persists in many other multi-locus analyses, as long as estimating z-score correlation at considered SNPs is required. These include a wide range of commonly used analytic procedures, including gene- and pathway-based tests [6–10, 15, 16, 26–28] and summary statistics imputation [11–14]. Inappropriate SNP coverage assumptions have no effect on the heritability estimates [18–20] but can lead to underestimated variances. The assumption on SNP coverage may only have a negligible impact on some other analyses that often target on SNPs in low LD, such as rare variant analyses [29–31] and Mendelian randomization [17].
To the best of our knowledge, except for the sARTP procedure for gene- and pathway-based analysis, other existing summary data-based multi-locus approaches either make the uniform or nested coverage assumption in their implementations, partially because the SNP coverage information is usually unavailable in public GWAS meta-analysis results. To maximize the utility of GWAS data, and more importantly, to ensure the validity of summary data-based joint analyses, we advocate that GWAS meta-analysis consortia should also post SNP coverage information, along with the study-specific sample sizes in the future. A convenient way to release such information is to follow the format adopted by programs METAL [32] and ARTP2 [5], which requires a column of SNP effect direction strings, e.g. ‘ ++?-’, where ‘+’ or ‘-’ gives the sign of the SNP effect relative to the reference allele, and ‘?’ indicates that the SNP is missing from a study. We also suggest that investigators who develop analytic tools based on GWAS summary data should pay attention to the variation in SNP coverage pattern and adjust for it appropriately.
Multi-locus analyses are effective joint testing procedures, as GWAS meta-analysis summary statistics become easily available.
Assuming inappropriate SNP coverage in many widely used multi-locus association tests, such as the conditional test and gene- and pathway-based tests, for analyzing summary data can result in highly inflated type I errors.
Publishing GWAS summary data with detailed SNP coverage information is recommended for GWAS consortia.
Investigators who develop or apply analytic tools for joint analyses based on summary data should be aware of the potential variation in SNP coverage pattern and adjust it accordingly.
Acknowledgements
This study used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov/). The authors thank the DIAGRAM consortium and the AGEN-T2D consortium for sharing the meta-analysis summary data. The authors acknowledge the editor and the anonymous reviewers for their constructive comments to improve this article.
Funding
The Intramural Research Program of the National Institute of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics. This research was supported by Intramural Research Program of the National Cancer Institute, USA.
Han Zhang is a research fellow at the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA.
William Wheeler is a senior statistician at the Information Management Services Inc., USA.
Lei Song is a bioinformatician at the Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., USA.
Kai Yu is a senior investigator at the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA.






