-
PDF
- Split View
-
Views
-
Cite
Cite
Florian Schmid, Matthias Schmid, Christoph Müssel, J. Eric Sträng, Christian Buske, Lars Bullinger, Johann M. Kraus, Hans A. Kestler, GiANT: gene set uncertainty in enrichment analysis, Bioinformatics, Volume 32, Issue 12, 15 June 2016, Pages 1891–1894, https://doi.org/10.1093/bioinformatics/btw030
Close -
Share
Abstract
Summary: Over the past years growing knowledge about biological processes and pathways revealed complex interaction networks involving many genes. In order to understand these networks, analysis of differential expression has continuously moved from single genes towards the study of gene sets. Various approaches for the assessment of gene sets have been developed in the context of gene set analysis (GSA). These approaches are bridging the gap between raw measurements and semantically meaningful terms.
We present a novel approach for assessing uncertainty in the definition of gene sets. This is an essential step when new gene sets are constructed from domain knowledge or given gene sets are suspected to be affected by uncertainty. Quantification of uncertainty is implemented in the R-package GiANT. We also included widely used GSA methods, embedded in a generic framework that can readily be extended by custom methods. The package provides an easy to use front end and allows for fast parallelization.
Availability and implementation: The package GiANT is available on CRAN.
Contacts:hans.kestler@leibniz-fli.de or hans.kestler@uni-ulm.de
1 Introduction
Differential expression analysis investigates the association of measurements of single genes to a predefined phenotype (e.g. tumour versus inflammation). Frequently yielding thousands of differentially expressed genes, such analyses are often hard to interpret in a biological context ( Zeeberg et al. , 2003 ). Gene set analyses aim at assigning a meaning to differentially expressed genes by comparing them to sets of genes whose relevance in processes or pathways is known. Such gene sets can be derived, e.g. from Gene Ontology ( Gene Ontology Consortium, 2000 ), KEGG ( Kanehisa and Goto, 2000 ), AgeFactDb ( Hühne et al. , 2014 ), Reactome ( Joshi-Tope et al. , 2005 ), WikiPathways ( Pico et al. , 2008 ) or other collections of gene sets ( Glez-Peña et al. , 2009 ; Huang et al. , 2007 ; Subramanian et al. , 2005 ). Albeit frequently used, gene set analysis suffers from pitfalls concerning the significance assessment ( Goeman and Bühlmann, 2007 ; Maciejewski, 2013 ). Also the large variety of methods makes the choice of which analysis to use often difficult. Beside that, issues arise from the crafting of the gene sets. For example, rapidly changing knowledge may not only affect the members of sets, but also the mapping of probe sets to gene identifiers ( Bleazard et al. , 2015 ; Retraction for Dixson et al. , 2014 ; Sedeño-Cortés and Pavlidis, 2014 ) and the validity of included genes under a certain condition (e.g. if a gene set is only enriched in the data because of genes that have been wrongly assigned to the gene set).
To address these issues of generating gene sets, we present a novel method to quantify the uncertainty in gene set analyses. This method assesses the impact of changes in the gene set definition on the result of a gene set analysis. We apply a bootstrap-type resampling strategy in which parts of the original gene set are replaced by randomly choosen genes. By analyzing the derived credibility intervals, an estimate of the certainty in the definition of the gene set can be derived. Such robustness assessments are essential for the validation of custom hand-crafted gene sets and are also of interest for sets extracted from the knowledge bases mentioned above.
2 Method and application
2.1 Method
In a robust gene set, slight changes in the definition of the set should not have strong effects on its statistical significance. We therefore implemented a robustness evaluation for gene set analysis that rates the certainty in the definition of a hand-crafted gene set. This evaluation is based on repeated gene set analyses with slightly modified versions of the set in order to measure how strongly uncertainty in the gene set affects statistical significance.
To evaluate the fuzziness of GS , we use the following three-step approach: In the first step, the test statistic of interest (denoted by t ) is computed for each of the samples , resulting in estimates of the distribution of t at various values of k . Setting k = 0, i.e. including none of the original gene set genes, results in the ‘null distribution’ of t . In the second step, 90% credibility intervals for t (denoted by ) are constructed for each k by evaluating the 5 and 95% quantiles of the estimated distributions. In the final step, the minimum value of k for which the credibility interval does not overlap with the null interval is calculated. We use this value, which we denote by , as an estimate of the certainty in the definition of GS, as large values of indicate a high sensitivity of t with regard to the random replacement of genes in GS and conversely low values of indicate a high robustness (see Fig. 1 ).
Quantification of the uncertainty in the Rb pathway. The lines with circles show the tested degrees of uncertainty ( k ) for the Rb pathway. The dots in each column give the quantiles ( ) of the test statistic values obtained by resampling a percentage of genes from the Rb pathway ( k ) and the remaining genes ( ) from the set of all genes in the dataset. As a test statistic the mean absolute correlation (Spearman) of the gene set genes to the class-label has been used. The values for k = 0 give the quantiles of the null distribution, with the lower horizontal line corresponding to the 95% quantile. The upper horizontal line shows the value of the test-statistic for the original Rb pathway. The estimated uncertainty of the gene set is the minimum value of k that has a non overlapping credibility interval with the credibility interval of the null distribution. For the Rb pathway this value is 85%. An estimate of the upper bound of certainty can be given by calculating the slope of the dotted line by using the median of null distribution and . When shifted to the 95%-quantile the point of intersection with is the upper bound (Color version of this figure is available at Bioinformatics online.)
Quantification of the uncertainty in the Rb pathway. The lines with circles show the tested degrees of uncertainty ( k ) for the Rb pathway. The dots in each column give the quantiles ( ) of the test statistic values obtained by resampling a percentage of genes from the Rb pathway ( k ) and the remaining genes ( ) from the set of all genes in the dataset. As a test statistic the mean absolute correlation (Spearman) of the gene set genes to the class-label has been used. The values for k = 0 give the quantiles of the null distribution, with the lower horizontal line corresponding to the 95% quantile. The upper horizontal line shows the value of the test-statistic for the original Rb pathway. The estimated uncertainty of the gene set is the minimum value of k that has a non overlapping credibility interval with the credibility interval of the null distribution. For the Rb pathway this value is 85%. An estimate of the upper bound of certainty can be given by calculating the slope of the dotted line by using the median of null distribution and . When shifted to the 95%-quantile the point of intersection with is the upper bound (Color version of this figure is available at Bioinformatics online.)
We implemented the R-package GiANT for analyzing the uncertainty in the definition of gene sets. Following Ackermann and Strimmer (2009) , the package also includes a toolbox for gene set analysis that is modularized into four steps. The first step is the analysis of differential expression. The resulting gene-level statistic may be transformed for subsequent steps and is then summarized for a specified gene set, i.e. a gene set statistic is calculated. The significance of this statistic with respect to certain null hypotheses is finally assessed in a resampling-based testing procedure. Importantly, the GiANT package is based on a highly generic framework that allows users to create custom analyses by replacing some or all steps by their own implementations. It also supports parallelization of calculations, which allows for the analysis of gene set collections, benchmark studies or approaches that combine the results of several gene set analyses ( Väremo et al. , 2013 ). Furthermore, the package also ensures that parallelized random number generation has no effect on outcome ( L’Ecuyer et al. , 2002 ).
2.2 Application
We demonstrate the handling of the GiANT package using a dataset of six p53-deficient liver tumor samples and seven samples of DEN-induced liver tumors in mice. The dataset includes 30 278 genes and no missing values. Following Katz et al. (2012 ) we extracted the gene expression values and performed a preprocessing of the dataset using the standard workflow. For normalization, baseline and percentile (75%) shifts were performed after log 2 transformation. Katz et al. (2012 ) showed a significant enrichment of a hand-crafted Rb pathway gene set in this dataset. The set consists of 123 up-stream genes, interaction partners and downstream targets related to the Rb pathway. All genes were collected from literature using the PubMed database. Expression levels of the Rb pathway genes in the data are illustrated in Figure 2 . In the following we evaluate the uncertainty in the definition of this hand-crafted Rb pathway gene set. As a test statistic the average absolute (Spearman) correlation to the class label of all genes in the set is calculated. Statistical significance is of the gene set is based on a computer intensive resampling test:
Heatmap of the 123 genes associated to the Rb pathway. Hierarchical clustering of samples (complete linkage) in the feature subspace of the Rb pathway coincides with the known classes DEN and P53 (lower and upper labeling). Expression values have been mean centred and scaled within each gene
Heatmap of the 123 genes associated to the Rb pathway. Hierarchical clustering of samples (complete linkage) in the feature subspace of the Rb pathway coincides with the known classes DEN and P53 (lower and upper labeling). Expression values have been mean centred and scaled within each gene
evaluateGeneSetUncertainty(
dat = dataset$data,
labs = dataset$labs,
geneSet = rbPathway,
analysis = gsaTools.averageCorrelation(),
method = “spearman”,
numSamplesUncertainty = 1000,
numSamples = 1000,
k = seq(0.01,0.99, by = 0.01))
A detailed step by step example is give in the vignette of the GiANT package. Figure 1 visualizes the distributions with different degrees of uncertainty: For each value of k the quantiles of the resulting distribution are given. k = 0 gives the null distribution with the lower horizontal (green) line showing the corresponding 95%-quantile. The upper horizontal (red) line gives the value of the test statistic for the Rb pathway . The black vertical line indicates the degree of uncertainty where the two distributions (null distribution and distribution with a fixed degree of uncertainty) do not overlap. As we can see, even for 15% of random genes, the perturbed gene sets achieve higher scores than the 95% quantile of the null distribution and are thus statistically significant with respect to the null distribution. This indicates that the Rb pathway gene set is robustly enriched ( k = 0.85) in the dataset.
3 Conclusion
Sets are the core of gene set enrichment analyses. Every analysis is based on their correct definition. Motivated by critical articles like Sedeño-Cortés and Pavlidis (2014) , Retraction for Dixson et al. (2014) and Bleazard et al. (2015) , we developed a new robustness analysis that assesses and quantifies the performed enrichment analyses using a partial resampling approach. These anylses are especially important for newly found pathways, but also well established and curated databases suffer from errors and uncertainty in their gene set definitions. Our method allows now a quantification of the gene set uncertainty and therefore an assessment of the validity of the enrichment analyses. The approach is implemented in the R package GiANT , which also provides a comprehensive toolkit for generic gene set analysis. Apart from standard methods like GSEA, user-defined workflows can be constructed readily within the flexible pipeline mechanism. This allows the user to build new high-level analyses, adapted to the specific context of use.
Funding
The research leading to these results received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under grant agreement no. 602783, the DFG (SFB 1074 project Z1), and the German Federal Ministry of Education and Research (BMBF, Gerontosys II, Forschungskern SyStaR, project ID 0315894A) all to H.A.K.
Conflict of Interest : none declared.
References
Author notes
† The authors wish it to be known that, in their opinion, the last two authors should be regarded as Joint Last Authors.
Associate Editor: Jonathan Wren


