Benchmarking enrichment analysis methods with the disease pathway network

Abstract Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.


Enrichment Analysis methods
The methods under investigation belonged to four different categories: (i) OVerlap Analysis (OVA), Per-Gene score Analysis (PGA), Pathway Topology Analysis (PTA) and Network Enrichment Analysis (NEA) methods.We report here algorithmic details of each EA category and method in the benchmark.Based on the underlying null hypothesis, an EA method can also be classified as competitive or self-contained [2].A self-contained method tests if genes in the set of interest are differentially expressed whereas a competitive method tests if they are as differentially expressed as the genes not in the set.For a self-contained test on a gene set, the differential expression of only one of its genes may be enough to reject the null hypothesis that none of the genes in the gene set are different, overall resulting in increased sensitivity.However, self-contained methods might not be helpful in a battery testing setup given the large number of significant reported results.As a consequence, most of the available methods are competitive.

OVerlap Analysis (OVA)
OVerlap Analysis (OVA) methods test the proportion of differentially expressed genes (DEGs) in a functional gene collection against a discrete probability distribution model.One-sided p-values are extracted with a Fisher's exact test in Fisher as implemented in [3], and with a modified Fisher's exact test in EASE [4], as implemented in DAVID [5].

Per-Gene score Analysis (PGA)
The most popular approach for analyzing expression data works with each gene separately, employing a statistical model to connect the response to each gene's expression.Each gene undergoes a local statistical test that is used to determine a parametric or permutation-based p-value of variation of expression.A global statistical analysis is then implemented to assign an Enrichment Score (ES) to a functional set of genes.For permutation-based methods, the significance of the global statistics for each functional gene set is then evaluated using 2000 sample permutations.Gene Set Enrichment Analysis (GSEA) uses a Kolmogorov-Smirnov (KS) like running-sum statistics to test if the rankings of the genes in a pathway are significantly biased towards high DE fold-changes or p-values [6].To derive the ES, the running-sum statistic is increased (or decreased) upon encountering a gene inside (or outside) the pathway meanwhile traversing down from the top of the sorted gene list.NES is the maximum deviation from zero (ES) normalized by the size of the gene set.Here fast GSEA (fGSEA) is used to conduct a pre-ranked GSEA [7].A p-value is estimated by permuting the genes in a gene set, which leads to randomly assigned gene sets of the same size.Gene Set Analysis (GSA) substitutes the KS-like statistic of GSEA with the "maxmean" statistic (i.e. the maximum mean gene scores between the up-or down-regulated component of a gene set), and scales it by its mean and standard deviation to obtain the re-standardized version of the "maxmean" statistic [8].GSA generates a null distribution for estimating false discovery rates by combining gene and sample permutations.ROtAtion gene Set Test (ROAST) implements a self-contained test that evaluates the enrichment of a gene set using rotation, a Monte Carlo simulation technology, rather than permutation [9].Gene Set Variation Analysis (GSVA) computes sample-wise gene set enrichment scores as a function of genes within and outside the gene set, transforming the data from a gene by sample matrix to a gene set by sample matrix [10].P-values are extracted by fitting a generalized linear model (GLM) to each pathway profile of ESs and computing a moderated t-test, by empirical Bayes moderation of the ES standard error towards a global value.As recommended in the manual, we used limma to examine variations in enrichment scores between sample groups.Correlation Adjusted MEan RAnk gene set test (CAMERA) implements a test that accounts for gene-to-gene correlations [11].CAMERA was run in two configurations: (i) CAMERA_fixed using a fixed inter-gene correlation of 0.01; (ii) CAMERA_flex using an inter-gene correlation calculated from the input dataset.Pathway Analysis with Down-weighting of Overlapping Genes (PADOG) pre-computes a table with gene frequencies in the gene sets database and then extracts a ES for each gene set by assigning the mean of absolute values of moderated gene t-scores, weighted by the gene frequencies [12].In this way, genes that are found in fewer gene sets have a great impact in the analysis, whereas the contribution of ubiquitous genes is down-weighted.The ES is tested against the null distribution, generated by sample permutations.

Pathway Topology Analysis (PTA)
Signaling Pathway Impact Analysis (SPIA) tests both the over-representation of DEGs and the perturbation of each pathway [13].The overlap between DEGs and pathway genes is tested with the hypergeometric test.The perturbation of a pathway is determined by propagating fold-changes in expression across the pathway topology.The pathway topology can be represented with a signed directed graph, where the sign represents activation or inhibition.To generate the SPIA pathway input, we used the function makeSPIAdata that processes KEGG xml files.To represent activation and inhibition, we assigned a weight vector β where 1 denotes activation, -1 denotes inhibition and 0 other associations.To assess the significance of a pathway perturbation, SPIA compares the observed net perturbation against the perturbation computed in a randomized scenario in which as many genes as the DEGs are allowed to occupy any position along the pathway and to have any log fold-change within the range of DEGs values.A final p-value is returned as the combination of the two analyses, by default using the Fisher method.SPIA does not return a functional enrichment for roughly 1/3 of total pathways, as metabolic pathways in KEGG do not include any activation and/or repression interaction.Centrality Pathway Over Representation Analysis (CePa-ORA) uses pathways as networks with nodes representing genes [14].For each pathway, CePa-ORA measures node weights using various centrality metrics: in-degree, out-degree, betweenness, in-largest reach, out-largest reach, and equal weight condition.The final pathway score is calculated by summing the weights of nodes that were differentially expressed in the DE analysis.Pathway significance is assessed using a null distribution derived from permutation of DE genes.Similarly to Nguyen et al. 2019 [15], we selected the final p-value for each pathway as the lowest p-value among the six derived from the different centrality measurements, as the authors did not indicate a preferred metric.

Network Enrichment Analysis (NEA)
NETwork-based Pathway Enrichment Analysis (netPEA) executes Random Walk with Restarts (RWR) on a network by using DEGs as seed nodes with a predetermined restart probability, by default of 0.5 [16].At steady state, the resulting probabilities will contain the likelihood of RWR ending in each network node that may be used to derive a distance between the DEGs and all other nodes in the graph.NetPEA averages the probabilities between all genes in a gene set to obtain a final ES. z-scores are calculated to extract p-values by extracting a null distribution with a size-aware resampling procedure of size 2000.Network Enrichment Analysis Test (NEAT) finds crosstalks between the DEGs and a gene set.The crosstalk is defined as the degree of interconnectivity between two gene sets mapped onto a functional association network.NEAT tests enrichment (or depletion) significance against a hypergeometric discrete null distribution of crosstalks [17].Naïvely, it calculates the expected number of links between two gene sets based on their degrees and then uses the hypergeometric distribution to assess statistical significance.In this way the method assumes that under no functional enrichment the crosstalk follows a hypergeometric distribution, thus the analysis speeds up.BINOmial null distribution of X-talk (BinoX) evaluates whether the crosstalk between a set of DEGs and a gene set is significantly different from the expected crosstalk, which is estimated from a null model based on randomisation of the network [18].By default, BinoX extracts N randomized networks based on the Link Assignment and Second Order Conservation method (LA+S), which preserves the original second order topological properties and the network degree distribution within the random model.Once the parameters of the random model have been determined, they may be used to assess the likelihood of observed crosstalk.Because all network connections are binary, it is assumed that crosstalk between any two gene sets follows a binomial distribution.As for NEAT and BinoX, if the expected crosstalk is less than the observed, the p-value for the upper tail represents enrichment, whereas the opposite may be interpreted as depletion and is calculated based on the lower tail of the distribution.As depletion is out of the scope of the benchmark, we take 1 -p-value as significance for all gene sets with smaller crosstalk than expected.Adaptive NUll distriButIon of X-talk (ANUBIX) tests the crosstalk between the DEGs and a gene set [19].
Statistical significance is assessed by testing the crosstalk against a per DE gene set adapted beta-binomial distribution estimated by computing a discrete null distribution of crosstalk computed with a degree-aware resampling procedure of size 2000.
KEGG is chosen to compare methods throughout the analysis.2 -Size of disease pathway networks at different levels of confidence.We report the details of the disease/target pathways according to KEGG (i.e.pathway name, pathway id and pathway subclass), and the size of their network of interactions with all other pathways ("All", including the target pathway itself).As we assess the sensitivity of the methods by using the disease pathway network at different levels of confidence, we report the number of all linked pathways and for different top cutoffs (10,20,40).See "Disease pathway network construction" in the Methods section for details about the network construction process.

Table 3 -
Average disease pathway similarity within KEGG subclasses.Listed is the average similarity, in terms of Jaccard index, between all disease pathways within the same KEGG subclass, for disease pathways in the Disease Pathway Network using the top 20 linked pathways.