Pathway-based analysis is rapidly emerging as an alternative but powerful approach for searching for disease causal genes from genomic datasets and has been applied to many complex diseases recently, but it is only now beginning to be applied in psychiatry. Here, we discuss critical issues in the pathway-based approach by specifically comparing the first pathway analysis of genome-wide association studies (GWAS) datasets in neuropsychiatric disorders by O'Dushlaine and colleagues (Molecular Psychiatry 2010, doi:10.1038/mp.2010.7) with our analysis. We also computed the power of gene set enrichment analysis, hypergeometric test, and SNP ratio test in order to assist future applications of these methods in pathway-based analysis of GWAS datasets. Overall, we suggest that the pathway-based approach is effective but caution is needed in interpreting the results of such analysis.
Genome-wide association studies (GWAS) have become a popular approach for searching for common genetic variants which increase susceptibility to complex diseases or traits such as neuropsychiatric disorders. However, at the stringent genome-wide significance level of p<5×10−8, many markers that are truly but weakly associated with disease often fail to be detected (Purcell et al.2009; Shi et al.2009; Stefansson et al.2009; Sullivan et al.2008). Recently, pathway-based or gene set enrichment analysis of GWAS datasets has emerged as an alternative but potentially powerful approach to searching for disease causal genes, assuming a complex disease such as schizophrenia might have resulted from a number of genes which disrupt one or more pathways or protein complexes. Specifically in psychiatry, O'Dushlaine and colleagues (2010) recently applied a pathway-based approach to mine for genetic signals for schizophrenia and bipolar disorder based upon GWAS data, which represented the first successful application of this approach in the field. Using two GWAS datasets they identified five pathways which were significantly associated with schizophrenia. However, only the cell adhesion molecular (CAM) pathway remained significant after corrections for multiple testing. Furthermore, the CAM pathway was also significant in an independent GWAS dataset (WTCCC) for bipolar disorder.
The pathway analysis approach in O'Dushlaine et al. (2010) relied on the sigle nucleotide polymorphism (SNP) ratio test (SRT). The SRT computes a ‘SNP ratio’ for each pathway, i.e. it divides the number of significant markers (defined by the authors as those with p<0.05) by the number of non-significant markers within a pathway. To estimate the significance of each pathway, O'Dushlaine et al. (2010) randomized the case-control status of individuals in the original GWAS data to generate 1000 permutation datasets and then calculated an empirical p value based on the permutation datasets. We performed a similar pathway-based analysis of GWAS datasets (Jia et al.2010) using the dataset from the Genetic Association Information Network (GAIN) (Shi et al.2009), one of the two schizophrenia GWAS datasets used by O'Dushlaine et al. (2010). In contrast with the results in O'Dushlaine et al. our approach found several other pathways consistently associated with schizophrenia, including glutamate metabolism pathway, TNFR1 pathway, and TGFβ signalling pathway (see Supplementary online material). In the following sections, we discuss some methodology issues in GWAS pathway analysis (GWASPA) that may help to explain discrepancies in ours and O'Dushlaine et al.'s study. The different results in our analysis from that of O'Dushlaine et al. suggested many factors might affect the results of pathway-based methods and, thus, caution is needed in interpreting the results of such analyses.
GWASPA: theory and application
Applications of pathway analysis to high-throughput datasets (e.g. microarray gene expression and GWAS data) have been extensively discussed recently (Perry et al.2009; Wang et al.2007, 2008, 2009). It has been suggested that there are two different statistical hypotheses for testing disease associations with a group of functionally related genes in the same pathway (Tian et al.2005; Wang et al.2008): (1) the genes in the candidate pathway show the same pattern of associations with disease compared to the rest of the genes; and (2) the candidate pathway is not associated with a trait/disease of interest. These two hypotheses are related but address different aspects in pathway analysis. While our analysis of the GAIN study using the hypergeometric test and the Gene Set Enrichment Analysis (GSEA) method tested hypothesis 1 or ‘enrichment’ in the pathways, the SRT employed in O'Dushlaine et al. (2010) tested disease association with the pathways more directly via hypothesis 2.
For testing pathway enrichment or hypothesis 1, a 2×2 table needs to be constructed, as in Table 1. In Table 1, x is the number of significant markers and M – x is the number of non-significant markers in a pathway of interest, K is the total number of significant markers and N – K is the total number of non-significant markers in the list of N SNPs mapped to pathways. Given a total of N SNPs where M of them are mapped to genes in the pathway, the probability of observing at least x SNPs from K SNPs can be calculated based on the hypergeometric distribution. While the hypergeometric test compares the proportion of significant markers in the pathway vs. the proportion of significant markers in the remaining SNPs, the SRT in O'Dushlaine et al. (2010) is only based on x and M – x, the first row in the 2×2 contingency table (Table 1), and ignores the background ratio of significant vs. non-significant markers. For comparison, we prepared data from GAIN GWAS for schizophrenia in the same way as in O'Dushlaine et al. (2010) [all markers in a pathway (pathway-wise), 5 kb gene boundary, Armitage trend test, KEGG annotations provided in SRT tool (O'Dushlaine et al.2009), cut-off p value 0.05, and 1000 times of permutation] and then performed the hypergeometric test. As a result, the top five significant pathways were ‘regulation of autophagy’, ‘valine, leucine and isoleucine biosynthesis’, ‘histidine metabolism’, ‘glutathione metabolism’, and ‘aminoacyl-tRNA biosynthesis’, while CAM pathway was ranked 22nd (see Supplementary online material). These results are remarkably different from those in O'Dushlaine et al. (2010). Of note, ‘glycan structures biosynthesis 1’ pathway, one of the five significant pathways in O'Dushlaine et al. (2010) was obsolete in the current version of the KEGG database (as of 27 February 2010). Unfortunately, we were unable to test ISC data due to its access limitation.
SNP, Single nucleotide polymorphism.
It is worth noting that both SRT and hypergeometric tests use a cut-off threshold p value to identify significant markers/genes. This cut-off threshold could impact the results greatly (see power analysis below). Besides, a cut-off-based approach treats association signals qualitatively, rather than quantitatively; thus, it may lose important information in the analysis by dichotomization of markers as significant or non-significant.
Alternatively, the GSEA, which is based on a modified Kolmogorov–Smirnov test, compares distributions of p values for genes in a pathway with the rest of the genes, and makes use of the continuous information in the p values (Subramanian et al.2005). GSEA has been successfully applied to identify the IL-12/IL-23 pathway that is significantly related to Crohn's disease (Wang et al.2009). We applied GSEA to analyse the schizophrenia GAIN dataset and found several significant pathways that are related to apoptosis, inflammation, and the immune system (e.g. TNFR1 pathway, TGFβ pathway, TOB1 pathway). This result supports the recent finding implicating the immune system (e.g. MHC region on chromosome 6) in schizophrenia by the ISC GWAS dataset as well as its combined analysis with MGS and SGENE GWAS datasets (Purcell et al.2009; Shi et al.2009; Stefansson et al.2009).
To test hypothesis 2, we first generated permutation datasets by randomizing the labels of cases/controls in the original GWAS dataset and then computed p values of each random set using the same method as in the real dataset. When the sample labels are randomized, SNP genotypes in the permutation datasets are independent of the diseases status, while the genomic structures between markers [e.g. the linkage disequilibrium (LD) structure] are maintained. Therefore, pathway analysis based on permuting sample labels accounts for LD patterns in the markers. An empirical p value on whether a pathway is significantly associated with the disease is then estimated in the same way as in O'Dushlaine et al. (2010), Wang et al. (2009) and our results above (Jia et al.2010).
We next computed the power of GSEA, hypergeometric test, and SRT in identifying disease-associated gene sets. Random gene sets were generated using the GAIN GWAS dataset by considering three factors: cut-off threshold of association p value, proportion of disease-associated genes in a gene set, and size of gene set. Three cut-off p values (denoted as C) were used to define disease association: 0.05, 0.01, and 0.001. We set gene set size (denoted as S) to be 30, 50 or 100 and proportion of disease-associated genes (denoted as R) to be 50% or 30%. As a result, a total of 18 random gene-set scenarios were available for each test (GSEA, hypergeometric, SRT) (see Table 2). In each scenario, to generate a random gene set with parameters C, S and R, we randomly selected S×R genes from those genes whose association p values were smaller than C and randomly selected S×(1 – R) genes from those genes whose association p values were greater than or equal to C. One hundred gene sets (i.e. 100 repeats) were generated in each random gene-set scenario and then pathway enrichment tests were performed using 1000 permutations by swapping sample labels in the original GWAS dataset. For each method, we estimated power of the tests at 0.05 significance level (α=0.05) by computing the proportion of gene sets with p values <0.05. Table 2 summarizes the power analysis results. All three methods had improved power when gene-set size was increased, when the strength of association for markers in the gene set was increased (decreased cut-off p value for selecting disease-associated markers), or when the proportion of disease-associated genes/SNPs increased (50% vs. 30%). For gene sets consisting of markers moderately associated (cut-off p value for selecting disease-associated genes=0.01) or highly associated (cut-off p value=0.001) with disease, among all tests, the hypergeometric test performed best with the highest power. On the other hand, for gene sets consisting of markers weakly associated with disease (cut-off p value=0.05), all three methods lacked power. Of note, in our power analysis we used similar parameters to Dinu et al. (2007). In realistic conditions, association signals might not be as strong as in this power analysis, especially in some complex diseases or traits such as schizophrenia.
GSEA, Gene Set Enrichment Analysis; SRT, SNP ratio test; SNP, Single nucleotide polymorphism.
In each power analysis, 100 random gene sets were generated. For hypergeometric test and SRT, p<0.01 was used to define a gene or a SNP as being significant with the disease.
Disease-associated genes were selected by the smallest p values in genes (GSEA and hypergeometric test) or by SNP p values (SRT).
For GSEA, NES⩾0 is also required for a gene set to be significant.
Commentary and recommendations
It is important to note that some additional factors might have also contributed to the discrepancies in these two studies: gene boundaries (20 kb in our analysis vs. 5 kb in O'Dushlaine et al.2010); basic allelic test in our analysis vs. Armitage trend test in O'Dushlaine et al. (2010) ; KEGG+BioCarta pathway annotations in our analysis vs. an older version of KEGG dataset in O'Dushlaine et al. (2010) ; a cut-off p value of 0.01 in our analysis vs. 0.05 in O'Dushlaine et al. (2010) and 10 000 times permutations in our analysis vs. 1000 times in O'Dushlaine et al. (2010). The exact effects of these factors on the results might be complex in each disease case and might vary in different diseases. For example, definition of gene boundary needs to balance the risk of missing important regulatory regions by a narrow gene boundary and the inflation of non-informative markers (i.e. dilution of association signal) by a large gene boundary. As shown in Table 3, when the gene boundary increases, more SNPs located in the gene regions will meet the significance threshold values and those additional SNPs and their corresponding genes would be used in the pathway-based analysis. So far, there has been no standard definition of gene boundary. However, recent studies have suggested 20 kb might be reasonable, as the majority of eQTLs in gene expression studies lay within 20 kb of the genes (Veyrieras et al.2008). Both basic allelic test and Armitage trend test have been used in previous pathway-based analysis of GWAS datasets (Wang et al.2007, 2009) because these two methods do not rely on strong assumptions and are unbiased in most cases compared to other tests, e.g. tests assuming dominant model, recessive mode, or additive model. Furthermore, the cut-off values for hypergeometric test and SRT to define association SNPs need to consider their power (see Table 2). The cut-off p values may also depend on the underlying genetic structure and the disease under investigation. Specifically for psychiatric disorders, which may involve multiple SNPs with each playing a small to moderate role, a cut-off of 0.01 was used in a previous study of bipolar disorder (Holmans et al.2009), as well as in our analysis (Jia et al.2010).
Number of SNPs that are mapped to genes using each gene boundary. The number of genes in which the SNPs are mapped is included in parentheses.
The total number of SNPs (in both genic and non-genic regions) within each p value range in the GWAS dataset.
In summary, pathway-based approaches have both the advantages and limitations in analysing high-throughput datasets, especially when applied in GWAS data analysis. Several factors, such as pathway annotations, pathway size, gene length, SNP density in genes, definition of gene boundaries, and assignment of a p value to a gene which has multiple SNPs, could impact the results substantially and make the analysis results unstable. It is important to clearly state the hypothesis to be tested and interpret the findings appropriately. Finally, considering the lack of power in each method for gene sets consisted of weakly associated genes and the effect of different parameters, when possible, we suggest researchers should apply multiple methods to evaluate the reliability of the results, perform meta-analysis to increase sample size, and use multiple datasets when available to replicate the findings. We caution that methodology issues such as those outlined here may need to be addressed before definite conclusions can be firmly drawn from any pathway-based analysis of GWAS.
Supplementary material accompanies this paper on the Journal's website.
We thank three anonymous reviewers for valuable suggestions. This project was partially supported by a NIH grant (AA017437) and a 2009 NARSAD Maltz Investigator Award (to Z. Z.). Z. Z. received additional support from Vanderbilt-Ingram Cancer Center Support Grant (NIH P30CA68485) and Vanderbilt's Specialized Program of Research Excellence in GI Cancer (grant no. 50CA95103).
Statement of Interest