Abstract

As meta-analysis results published by consortia of genome-wide association studies (GWASs) become increasingly available, many association summary statistics-based multi-locus tests have been developed to jointly evaluate multiple single-nucleotide polymorphisms (SNPs) to reveal novel genetic architectures of various complex traits. The validity of these approaches relies on the accurate estimate of z-score correlations at considered SNPs, which in turn requires knowledge on the set of SNPs assessed by each study participating in the meta-analysis. However, this exact SNP coverage information is usually unavailable from the meta-analysis results published by GWAS consortia. In the absence of the coverage information, researchers typically estimate the z-score correlations by making oversimplified coverage assumptions. We show through real studies that such a practice can generate highly inflated type I errors, and we demonstrate the proper way to incorporate correct coverage information into multi-locus analyses. We advocate that consortia should make SNP coverage information available when posting their meta-analysis results, and that investigators who develop analytic tools for joint analyses based on summary data should pay attention to the variation in SNP coverage and adjust for it appropriately.

Introduction

Genome-wide association study (GWAS) has become an effective way to identify common genetic variants underlying various complex traits. To maximize the chance of discovering novel outcome-associated variants by increasing the sample size, many consortia have been formed to conduct meta-analyses on data across multiple GWAS [1]. The standard way for a GWAS meta-analysis is through the single-locus analysis, which evaluates one single-nucleotide polymorphism (SNP) at a time. As the easily accessible SNP-level association summary statistics (e.g. log odds ratio and its SD) generated from large-scale GWAS meta-analysis become increasingly available, various multi-locus methods have been proposed for conducting summary data-based analyses. One such test is the conditional test [2], which assesses the SNP association by conditioning on SNPs at previously identified loci. Another example is gene- or pathway-based association test [3–10], which jointly evaluates association between the outcome and SNPs at one or multiple genes. Other procedures include summary statistics imputation [11–14], transcriptome-wide association study [15, 16], Mendelian randomization [17] and genetic heritability estimation [18–20]. These popular analytic procedures have yielded new insights into the genetic architectures of complex traits and diseases.

A critical step in many summary data-based multi-locus analyses is to estimate correlations of z-scores, which are usually approximated by genotype correlations in a population-specific reference panel such as 1000 Genomes Project [21], or an existing GWAS. These estimations are accurate only if all participating studies investigate the same set of SNPs. This is not often the case in practice because of different genotyping platforms in use. Although genotype imputation has been routinely applied to individual studies before they are combined for the meta-analysis, heterogeneity in SNP coverage usually still exists in many published meta-analysis results. For example, there is considerably SNP coverage heterogeneity in meta-analysis results published by the successful DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium. We will show in this report that the accurate estimation of z-score correlations requires knowledge of exact SNP coverage information, i.e. the set of SNPs evaluated in each participating study. We demonstrate through numerical experiments and real applications that assuming inaccurate SNP coverage pattern would result in highly inflated type I errors in several popular multi-locus analyses, including the conditional test, and gene/pathway association tests. We also show the proper way to incorporate the correct coverage information into these analyses.

Method

In general, genotypes are imputed when conducting GWAS meta-analyses. For the rest of the discussion, we assume that GWAS data have been imputed, and do not differentiate genotyped and imputed SNPs. We start by briefly reviewing the methodology underlying the summary data-based multi-locus analysis. We assume that T SNPs are tested in a GWAS meta-analysis of L studies, with the lth study, l=1, , L, consisting of nl subjects. We use the binary index matrix C=cltL×T to define the SNP coverage in each study, with clt=1 if the tth SNP, is evaluated by the lth study, clt=0 otherwise. Denote nt=l=1Lcltnl to be the total sample size used in the meta-analysis for SNP t, and nts=l=1Lcltclsnl to be the number of subjects with genotypes available on both SNPs t and s. Let β^t and τt represent effect size estimate (such as the log odds ratio) of the tth SNP and its standard error, derived from the fixed effects model. Using the framework in [4, 5], we can recover the z-score at SNP t as zt=τt-1β^t, and approximate the variance–covariance matrix of z-scores as V=vtsT×T, where  
vts=κtsρts,
with  
κts=ntsntns,
and ρts being the Pearson’s correlation coefficient for genotypes on SNPs t and s. The correlation coefficient ρts can be estimated from a panel of reference genomes matching the study population [22]. The accuracy of estimating the factor κts depends on the knowledge of SNP coverage pattern across participating studies. It has been shown that many of the multi-locus analyses, such as the conditional test [2] and gene- and pathway-based association tests [3–7], can be carried out properly given zt and vts at considered SNPs, regardless of linkage disequilibrium (LD) level and study sample size variation.

From the definition of z-score correlation vts, conducting a multi-locus analysis requires the knowledge of the SNP coverage index matrix C, as well as the sample size n(l) of each participating study. The GWAS Consortium sometimes does not provide the information of C for the meta-analysis, but instead releases nt for each SNP. In some cases, it is possible to uniquely resolve C from equations nt=l=1Lcltnl given the sample size n(l) of participating studies. For example, we were able to recover the exact coverage information by applying this strategy to the summary data of the type 2 diabetes (T2D) GWAS released by the DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) consortium [23]. This strategy, however, could become infeasible when the number of participating studies is large, or study-specific sample sizes are unknown.

Common SNP coverage assumptions

When both C and n(l) are unknown, one option is to set κts=1, which essentially assumes that the same set of SNPs is evaluated by all participating studies. This strategy has been widely used in many multi-locus analysis procedures based on GWAS summary data without an explicit justification [3, 4, 6, 7]. We call this the uniform coverage assumption hereafter.

When nt is known, but the coverage information C is unknown, a commonly used strategy is to assume nts=minnt, ns, which is true when the set of SNPs in participating studies is nested within each other [2, 20]. We call this the nested coverage assumption.

Both uniform and nested coverage assumptions could lead to overestimation of z-score correlations, especially if the SNPs are in moderate or high LD. For example, when two SNPs are exclusively studied in two separate studies with equal sample sizes, the correlation between their meta-analysis z-scores is strictly 0 (i.e. κts=0). This correlation, however, is estimated to be ρts if a uniform or nested coverage pattern is inappropriately assumed. In the following, we will describe how the bias in estimating the z-score correlation would affect some commonly used multi-locus tests.

Conditional test based on GWAS summary data

The conditional test is one of the most commonly used methods for secondary analysis in GWAS. For example, it can be used in fine mapping to identify additional signals within an established region [24]. As the conditional test is equivalent to the single-marker test if the target SNP and the index SNP being conditioned on are independent, the conditional test only considers SNPs that are correlated with the index SNP. The test statistic of the conditional test on SNP t can be expressed as:  
Tcond=zt-vtszs1-vts2.

Note that 1-vts2 is the conditional variance of zt, given that the marginal association evidence zs has been observed at the index SNP s.

Overestimating the correlation vts because of the inappropriate assumptions can enormously change the value of Tcond in both directions. The direction depends on the relative magnitude of changes caused by the bias in the nominator and denominator of Tcond. It could lead to either type I error inflation or power reduction on the testing SNPs. To illustrate this, we consider two SNPs studied with equal sample size n. There are cn subjects having genotypes on both SNPs, where the proportion parameter c[0,1]. In this case, making the uniform coverage assumption is equivalent to making the nested one. Under both assumptions, vts is estimated to be ρts, no matter what the value c takes. In contrast, its consistent estimate based on the accurate coverage information is cρts. We fix zs=5.45 at the index SNP, i.e. the index SNP reaches the common genome-wide significant P-value threshold (5.0×10-8), and ρts=0.8. The z-score zt of target SNP varies from 0 to 5. The ratio of conditional P-values (in  -log10 scale) between the one using the wrong estimate vts=ρts and the one using the consistent estimate vts=cρts is showed in Figure 1. In the plot, ratios that are >10 are truncated to be 10 for the sake of clear presentation. From Figure 1, we have two main observations. First, the conditional test assuming the uniform or the nested coverage pattern tends to generate a P-value that is smaller than its true value when the marginal association evidence at the target SNP is moderate or weak (e.g. zt<2.5). The significance of the P-value becomes more exaggerated as c decreases (i.e. fewer subjects having their genotyped measured on both SNPs). Second, the conditional test could lose its power under the wrong assumptions when the z-score at the target SNP is relatively large (e.g. zt>4). We will show through real data analysis later that the type I error of the conditional test assuming incorrect coverage pattern is more likely to be inflated in practice, as most testing SNPs have relatively small marginal z-scores.

Figure 1

Contour plot of the ratio of conditional test P-values (in -log10 scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: z-score at the target SNP; y-axis: the proportion parameter c as defined in the main text.

Figure 1

Contour plot of the ratio of conditional test P-values (in -log10 scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: z-score at the target SNP; y-axis: the proportion parameter c as defined in the main text.

Gene-based association test based on GWAS summary data

The gene-based association tests are powerful alternative to the single-marker test for identifying associated loci in GWAS. We use the minP test [3] as an example to show how inappropriate coverage assumptions can affect the P-values of a gene-based test. For a gene with m SNPs, the minP test statistic is defined as:  
TminP=max1tmzt.
Given the observed value T0, the P-value of minP test can be calculated as:  
pminP=1-PTminPT0=1-∫∫∫-T0ztT0,t dFz;V,
where F(z;V) is the distribution function of a normal distributed vector z, which follows N(0,V). We consider a gene with 10 SNPs, with a pairwise correlation ρts=0.8. We assume the SNPs can be divided into two groups, each with five SNPs. Each SNP has its genotype measured on n subjects. We further assume that the SNP coverage within each group is uniform. For any two SNPs from different groups, we assume there are cn subjects, where the proportion parameter c[0,1], having their genotypes measured on both SNPs. Again, making the uniform or the nested coverage assumption over the gene has the same effect on the gene-level P-value because the estimate of z-score correlation remains the same (i.e. ρts) under both assumptions. In contrast, a consistent estimate of vts could be cρts for any pair of SNPs from different groups, and ρts for any pair of SNPs within a group. In Figure 2, we show the ratio of gene-level P-values (in  -log10 scale) between the test using vts=ρts and the test using vts=cρts, given various values of T0. We can see from Figure 2 that the test based on the wrong assumptions always generates a P-value that is smaller than its true value in this considered setting. Given that pathway-based association tests accumulate association evidence from individual genes, they are expected to be more vulnerable to the bias in estimating the z-score correlation. Results from real applications reported below will confirm this.
Figure 2

Contour plot of the ratio of the gene-level P-values (in -log10 scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: observed value of the minP test statistic; y-axis: the proportion parameter c as defined in the main text.

Figure 2

Contour plot of the ratio of the gene-level P-values (in -log10 scale) between the one based on the uniform assumption and the one using the exact SNP coverage information. x-axis: observed value of the minP test statistic; y-axis: the proportion parameter c as defined in the main text.

Results

In the following, we show through two real data sets how the performance of conditional test, and the gene- and pathway-based association tests, can be affected by different SNP coverage patterns. On both data sets, imputation was performed in each of the participating studies before conducting the meta-analyses. One data set is the meta-analysis result on 2.5 million SNPs generated from the DIAGRAM consortium, which consists of 12 171 T2D cases and 56 862 controls across 12 GWAS with European ancestry [23]. The other data set is the meta-analysis result on 2.6 million SNPs generated by the Asian Genetic Epidemiology Network (AGEN) consortium, which combines 8 GWAS of T2D with 6952 and 11 865 controls from Eastern Asian populations [25]. On both data sets, we exclude SNPs within ±500 kb of previously GWAS-identified T2D loci, so that potential association variants caused by LD with established loci are eliminated from our analysis. The DIAGRAM and AGEN meta-analysis results have genomic control inflation factors λ=1.10 and λ=1.08, respectively. We further adjusted for these genome-wide inflations in our analysis by dividing the z-score by λ at each SNP [5]. The purpose of excluding established loci and adjusting for λ is to mimic the situation where there is no enriched association signal, so that inflation observed in testing P-values can be attributed to inappropriate assumptions on SNP coverage. The sample size of each SNP in DIAGRAM and AGEN data sets is showed in Figure 3. The SNP indexes are arranged per their sample sizes nt. Compared with the AGEN data, the DIAGRAM data exhibit less but still noticeable heterogeneity in the SNP coverage across participating studies. Specifically, 19.0% of the SNPs are genotyped or imputed for all subjects in the DIAGRAM study; another 57.9% of the SNPs are available on 9580 cases and 53 810 controls. The nested coverage assumption is valid for SNPs in these two groups. To estimate genotype correlations, we used a panel of 503 European subjects (CEU, TSI, FIN, GBR and IBS), and a panel of 312 Eastern Asian subjects (CHB, CHS and JPT) from 1000 Genomes for analyzing DIAGRAM and AGEN data sets, respectively.

Figure 3

Sample sizes for each of SNPs in the (A) DIAGRAM and (B) AGEN studies. The SNP indexes are arranged per their sample sizes.

Figure 3

Sample sizes for each of SNPs in the (A) DIAGRAM and (B) AGEN studies. The SNP indexes are arranged per their sample sizes.

For the conditional test, we randomly picked regions of 100 kb in length throughout the genome. In each region, we focused on SNP pairs with squared Pearson’s correlation coefficients in the range of 0.5–0.9, and with minor allele frequencies (MAFs) > 1%. The conditional test was performed for every SNP pair with correct coverage information (https://CRAN.R-project.org/package=SCAT), as well as under the nested and uniform coverage assumptions. In total, 5 million unique SNP pairs were tested. For the gene-based analysis, we illustrated our results by applying the minP method [3] on >17 000 genes, which were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13 using the Ensemble BioMart tool. For the pathway-based analysis, we applied the summary based Adaptive Rank Truncated Product (sARTP) method [5] on 4771 human and murine (mammalian) pathways (gene sets) collected from the MSigDB v5.0 (C2: curated gene sets). Like most pathway approaches, the sARTP method integrates association evidence from individual genes. We chose to use the minP statistic in sARTP to summarize gene-level association.

Figure 4 illustrates the Q-Q plots of conditional test P-values on two data sets. The conditional P-values using exact, nested or uniform coverages have inflation factors λ=0.99, 1.13 and 1.37 on DIAGRAM, and λ=1.00, 1.08 and 1.15 on AGEN. Substantial inflation emerges under the two inappropriate SNP coverage assumptions, even after z-scores have been rescaled by the square root of genomic control inflation factor, and after established T2D loci and their nearby SNPs have been excluded. Note that in Figure 4, points on the tails of Q-Q plots that show large deviation from the null represent only a small fraction of all 5 million points. The inflation vanishes, i.e. λ is close to 1.0 when the exact coverage information is used. Table 1 shows the proportions of conditional test P-values below certain thresholds. Although the nested coverage assumption gives less inflated P-values than those under the uniform coverage assumption, both assumptions lead to much higher false-positive rates than the one using correct coverage information.

Table 1.

Empirical type I errors of conditional tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages

Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.10 0.13 0.17 0.10 0.12 0.13 
0.01 0.015 0.023 0.046 0.011 0.017 0.026 
0.001 0.0026 0.0058 0.017 0.0013 0.0034 0.0084 
0.0001 0.00062 0.0019 0.0080 0.00023 0.0011 0.0041 
Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.10 0.13 0.17 0.10 0.12 0.13 
0.01 0.015 0.023 0.046 0.011 0.017 0.026 
0.001 0.0026 0.0058 0.017 0.0013 0.0034 0.0084 
0.0001 0.00062 0.0019 0.0080 0.00023 0.0011 0.0041 
Table 1.

Empirical type I errors of conditional tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages

Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.10 0.13 0.17 0.10 0.12 0.13 
0.01 0.015 0.023 0.046 0.011 0.017 0.026 
0.001 0.0026 0.0058 0.017 0.0013 0.0034 0.0084 
0.0001 0.00062 0.0019 0.0080 0.00023 0.0011 0.0041 
Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.10 0.13 0.17 0.10 0.12 0.13 
0.01 0.015 0.023 0.046 0.011 0.017 0.026 
0.001 0.0026 0.0058 0.017 0.0013 0.0034 0.0084 
0.0001 0.00062 0.0019 0.0080 0.00023 0.0011 0.0041 

Figure 4

The Q-Q plots for the conditional test P-values on 5 million pairs of SNPs in the (A) DIAGRAM and (B) AGEN studies. For each pair of SNPs, the squared Pearson’s correlation coefficient is in the range of 0.5–0.9, and has MAF > 1%.

Figure 4

The Q-Q plots for the conditional test P-values on 5 million pairs of SNPs in the (A) DIAGRAM and (B) AGEN studies. For each pair of SNPs, the squared Pearson’s correlation coefficient is in the range of 0.5–0.9, and has MAF > 1%.

The Q-Q plots in Figure 5 and the empirical type I errors in Table 2 suggest that the minP gene-based test under the nested coverage assumption seems to work reasonably well, compared with the one using exact coverage information, especially in the DIAGRAM data set. However, further inspection reveals that 75.4 and 97.1% of the genes, using the DIAGRAM and AGEN data, respectively, have smaller P-values if the nested coverage rather than the exact coverage is assumed. The gene-level inflation factors assuming the exact, nested and uniform coverages for DIAGRAM data are λ=0.96, 0.97 and 1.16, and for AGEN data are λ=1.15, 1.26 and 2.03. These observations suggest that procedures making inappropriate coverage assumptions would overestimate the significance of most genes, especially for the AGEN study, in which the SNP coverage pattern is more varied than that in the DIAGRAM study.

Table 2.

Empirical type I errors of gene-based tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages

Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.11 0.11 0.12 0.11 0.12 0.16 
0.01 0.015 0.015 0.016 0.014 0.015 0.021 
0.001 0.0025 0.0025 0.0030 0.0017 0.0017 0.0024 
Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.11 0.11 0.12 0.11 0.12 0.16 
0.01 0.015 0.015 0.016 0.014 0.015 0.021 
0.001 0.0025 0.0025 0.0030 0.0017 0.0017 0.0024 
Table 2.

Empirical type I errors of gene-based tests applied to the DIAGRAM and AGEN data assuming exact, nested and uniform coverages

Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.11 0.11 0.12 0.11 0.12 0.16 
0.01 0.015 0.015 0.016 0.014 0.015 0.021 
0.001 0.0025 0.0025 0.0030 0.0017 0.0017 0.0024 
Nominal levelDIAGRAM
AGEN
ExactNestedUniformExactNestedUniform
0.1 0.11 0.11 0.12 0.11 0.12 0.16 
0.01 0.015 0.015 0.016 0.014 0.015 0.021 
0.001 0.0025 0.0025 0.0030 0.0017 0.0017 0.0024 

Figure 5

The Q-Q plots for the minP test applied to the (A) DIAGRAM and (B) AGEN studies. The genes were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13.

Figure 5

The Q-Q plots for the minP test applied to the (A) DIAGRAM and (B) AGEN studies. The genes were downloaded from Homo sapiens genes NCBI36 and reference genome GRCh37.p13.

We next show how the bias in a gene-based analysis could be further amplified in a pathway-based analysis, which accumulates association signals from multiple genes, and hence aggregates the gene-level bias. Figure 6 shows the Q-Q plots for pathway-based analyses of 4771 candidate pathways. Like the gene-based analysis, the difference among pathway analyses under three SNP coverage assumptions is more obvious in the AGEN study than those in the DIAGRAM study. The AGEN study under the nested and uniform coverage assumptions has inflations λ=2.53 and 7.29, respectively, much higher than that using correct coverage information (λ=1.39). This implies that significant pathways identified under inappropriate coverage assumptions could be false positives. The inflation for the pathway analyses using correct coverage information is higher than expected (λ=2.09 for DIAGRAM and λ=1.39 for AGEN) when compared with the inflation observed in typical GWAS single-locus analysis because of two main reasons. First, the pathway P-values are correlated, as there are substantial overlaps in genes among tested pathways; hence, the inflation factor for a pathway analysis could have larger variation than that of a single-locus analysis. Second, even though the GWAS established loci have been excluded, and the genomic control inflation factor accounted for, the pathway analysis can aggregate weak association evidence from the remaining SNPs to detect association-enriched pathways. More discussion on this phenomenon can be found in [5].

Figure 6

The Q-Q plots for the sARTP test applied to the (A) DIAGRAM and (B) AGEN studies. The pathways (gene sets) were collected from the MSigDB v5.0 (C2: curated gene sets).

Figure 6

The Q-Q plots for the sARTP test applied to the (A) DIAGRAM and (B) AGEN studies. The pathways (gene sets) were collected from the MSigDB v5.0 (C2: curated gene sets).

Discussion

In summary, the validity of the multi-locus analyses based on GWAS summary data depends on how accurate the correlations of z-scores are estimated, which in turn requires knowledge on the SNP coverage in individual studies participating in the meta-analysis. When the SNP coverage information is unknown, our results demonstrate that three commonly used multi-locus analyses under oversimplified SNP coverage assumptions could suffer from inflated type I errors. This issue persists in many other multi-locus analyses, as long as estimating z-score correlation at considered SNPs is required. These include a wide range of commonly used analytic procedures, including gene- and pathway-based tests [6–10, 15, 16, 26–28] and summary statistics imputation [11–14]. Inappropriate SNP coverage assumptions have no effect on the heritability estimates [18–20] but can lead to underestimated variances. The assumption on SNP coverage may only have a negligible impact on some other analyses that often target on SNPs in low LD, such as rare variant analyses [29–31] and Mendelian randomization [17].

To the best of our knowledge, except for the sARTP procedure for gene- and pathway-based analysis, other existing summary data-based multi-locus approaches either make the uniform or nested coverage assumption in their implementations, partially because the SNP coverage information is usually unavailable in public GWAS meta-analysis results. To maximize the utility of GWAS data, and more importantly, to ensure the validity of summary data-based joint analyses, we advocate that GWAS meta-analysis consortia should also post SNP coverage information, along with the study-specific sample sizes in the future. A convenient way to release such information is to follow the format adopted by programs METAL [32] and ARTP2 [5], which requires a column of SNP effect direction strings, e.g. ‘ ++?-’, where ‘+’ or ‘-’ gives the sign of the SNP effect relative to the reference allele, and ‘?’ indicates that the SNP is missing from a study. We also suggest that investigators who develop analytic tools based on GWAS summary data should pay attention to the variation in SNP coverage pattern and adjust for it appropriately.

Key Points

  • Multi-locus analyses are effective joint testing procedures, as GWAS meta-analysis summary statistics become easily available.

  • Assuming inappropriate SNP coverage in many widely used multi-locus association tests, such as the conditional test and gene- and pathway-based tests, for analyzing summary data can result in highly inflated type I errors.

  • Publishing GWAS summary data with detailed SNP coverage information is recommended for GWAS consortia.

  • Investigators who develop or apply analytic tools for joint analyses based on summary data should be aware of the potential variation in SNP coverage pattern and adjust it accordingly.

Acknowledgements

This study used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov/). The authors thank the DIAGRAM consortium and the AGEN-T2D consortium for sharing the meta-analysis summary data. The authors acknowledge the editor and the anonymous reviewers for their constructive comments to improve this article.

Funding

The Intramural Research Program of the National Institute of Health, National Cancer Institute, Division of Cancer Epidemiology and Genetics. This research was supported by Intramural Research Program of the National Cancer Institute, USA.

Han Zhang is a research fellow at the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA.

William Wheeler is a senior statistician at the Information Management Services Inc., USA.

Lei Song is a bioinformatician at the Cancer Genomics Research Laboratory, Frederick National Laboratory for Cancer Research, Leidos Biomedical Research Inc., USA.

Kai Yu is a senior investigator at the Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, USA.

References

1

Thompson
JR
,
Attia
J
,
Minelli
C.
The meta-analysis of genome-wide association studies
.
Brief Bioinform
2011
;
12
:
259
69
.

2

Yang
J
,
Ferreira
T
,
Morris
AP
, et al. 
Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits
.
Nat Genet
2012
;
44
:
369
75
.

3

Liu
JZ
,
McRae
AF
,
Nyholt
DR
, et al. 
A versatile gene-based test for genome-wide association studies
.
Am J Hum Genet
2010
;
87
:
139
45
.

4

Hu
YJ
,
Berndt
SI
,
Gustafsson
S
, et al. 
Meta-analysis of gene-level associations for rare variants based on single-variant statistics
.
Am J Hum Genet
2013
;
93
:
236
48
.

5

Zhang
H
,
Wheeler
W
,
Hyland
PL
, et al. 
A powerful procedure for pathway-based meta-analysis using summary statistics identifies 43 pathways associated with type II diabetes in European populations
.
Plos Genet
2016
;
12
:
e1006122
.

6

Lamparter
D
,
Marbach
D
,
Rueedi
R
, et al. 
Fast and rigorous computation of gene and pathway scores from SNP-based summary statistics
.
PLoS Comput Biol
2016
;
12
:
e1004714
.

7

Kwak
IY
,
Pan
W.
Adaptive gene- and pathway-trait association testing with GWAS summary statistics
.
Bioinformatics
2016
;
32
:
1178
84
.

8

Burren
OS
,
Guo
H
,
Wallace
C.
VSEAMS: a pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes
.
Bioinformatics
2014
;
30
:
3342
8
.

9

Evangelou
M
,
Smyth
DJ
,
Fortune
MD
, et al. 
A method for gene-based pathway analysis using genomewide association study summary statistics reveals nine new type 1 diabetes associations
.
Genet Epidemiol
2014
;
38
:
661
70
.

10

Mishra
A
,
Macgregor
S.
VEGAS2: software for more flexible gene-based testing
.
Twin Res Hum Genet
2015
;
18
:
86
91
.

11

Lee
D
,
Bigdeli
TB
,
Riley
BP
, et al. 
DIST: direct imputation of summary statistics for unmeasured SNPs
.
Bioinformatics
2013
;
29
:
2925
7
.

12

Park
DS
,
Brown
B
,
Eng
C
, et al. 
Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses
.
Bioinformatics
2015
;
31
:
i181
9
.

13

Pasaniuc
B
,
Zaitlen
N
,
Shi
H
, et al. 
Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
.
Bioinformatics
2014
;
30
:
2906
14
.

14

Xu
Z
,
Duan
Q
,
Yan
S
, et al. 
DISSCO: direct imputation of summary statistics allowing covariates
.
Bioinformatics
2015
;
31
:
2434
42
.

15

Barbeira
A
,
Dickinson
SP
,
Torres
JM
, et al. 
Integrating tissue specific mechanisms into GWAS summary results
.
bioRxiv
2016
. doi: https://doi.org/10.1101/045260.

16

Gusev
A
,
Ko
A
,
Shi
H
, et al. 
Integrative approaches for large-scale transcriptome-wide association studies
.
Nat Genet
2016
;
48
:
245
52
.

17

Burgess
S
,
Dudbridge
F
,
Thompson
SG.
Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods
.
Stat Med
2016
;
35
:
1880
906
.

18

Bulik-Sullivan
BK
,
Loh
PR
,
Finucane
HK
, et al. 
LD Score regression distinguishes confounding from polygenicity in genome-wide association studies
.
Nat Genet
2015
;
47
:
291
5
.

19

Finucane
HK
,
Bulik-Sullivan
B
,
Gusev
A
, et al. 
Partitioning heritability by functional annotation using genome-wide association summary statistics
.
Nat Genet
2015
;
47
:
1228
35
.

20

Yang
J
,
Benyamin
B
,
McEvoy
BP
, et al. 
Common SNPs explain a large proportion of the heritability for human height
.
Nat Genet
2010
;
42
:
565
9
.

21

1000 Genomes Project Consortium
,
Abecasis
GR
,
Auton
A
, et al. 
An integrated map of genetic variation from 1,092 human genomes
.
Nature
2012
;
491
:
56
65
.

22

Pasaniuc
B
,
Price
AL.
Dissecting the genetics of complex traits using summary association statistics
.
Nat Rev Genet
2017
;
18
:
117
27
.

23

Morris
AP
,
Voight
BF
,
Teslovich
TM
, et al. 
Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes
.
Nat Genet
2012
;
44
:
981
90
.

24

Spain
SL
,
Barrett
JC.
Strategies for fine-mapping complex traits
.
Hum Mol Genet
2015
;
24
:
R111
9
.

25

Cho
YS
,
Chen
CH
,
Hu
C
, et al. 
Meta-analysis of genome-wide association studies identifies eight new loci for type 2 diabetes in East Asians
.
Nat Genet
2011
;
44
:
67
72
.

26

Lee
D
,
Williamson
VS
,
Bigdeli
TB
, et al. 
JEPEG: a summary statistics based tool for gene-level joint testing of functional variants
.
Bioinformatics
2015
;
31
:
1176
82
.

27

Network Pathway Analysis Subgroup of Psychiatric Genomics Consortium
.
Psychiatric genome-wide association study analyses implicate neuronal, immune and histone pathways
.
Nat Neurosci
2015
;
18
:
199
209
.

28

Segre
AV
;
Diagram Consortium, Magic investigators
, et al. 
Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits
.
PLoS Genet
2010
;
6
:
e1001058
.

29

Lee
S
,
Abecasis
GR
,
Boehnke
M
, et al. 
Rare-variant association analysis: study designs and statistical tests
.
Am J Hum Genet
2014
;
95
:
5
23
.

30

Liu
DJ
,
Peloso
GM
,
Zhan
X
, et al. 
Meta-analysis of gene-level tests for rare variant association
.
Nat Genet
2014
;
46
:
200
4
.

31

Tang
ZZ
,
Lin
DY.
Meta-analysis for discovering rare-variant associations: statistical methods and software programs
.
Am J Hum Genet
2015
;
97
:
35
53
.

32

Willer
CJ
,
Li
Y
,
Abecasis
GR.
METAL: fast and efficient meta-analysis of genomewide association scans
.
Bioinformatics
2010
;
26
:
2190
1
.

This work is written by US Government employees and is in the public domain in the US.