Toward a gold standard for benchmarking gene set enrichment analysis

Abstract Motivation Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. Results We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. Availability http://bioconductor.org/packages/GSEABenchmarkeR Contact ludwig.geistlinger@sph.cuny.edu

6 S1. 4 Comparison of the relevance score to alternative measures . .

S1.1 Enrichment methods
In the following, we provide a brief overview of the main features of each enrichment method listed in Table 1 of the main manuscript with a focus on functionality, statistical approach, and implementation. Please refer to the references provided in Table 1 of the main manuscript for methodological details, and the R documentation of each method for implementation details. The categorization of each method as either self-contained or competitive is discussed in Supplementary Discussion S2. 1. All methods under benchmark were carried out through the EnrichmentBrowser package [1], which uses its own implementation of ORA, available R-scripts for GSEA and SAMGS, the corresponding CRAN package for GSA, and the respective Bioconductor packages for GLOBALTEST, SAFE, ROAST, CAMERA, PADOG, and GSVA.
ORA: Overrepresentation Analysis. Tests the overlap between the differentially expressed genes and genes in a gene set based on the hypergeometric distribution. Implementation: function phyper from the stats package. See Supplementary Discussion S2.1 for details.
GLOBALTEST: Global testing of groups of genes. Implements a self-contained test of groups of genes for association with a response variable. Implemented in the globaltest package [2]. We executed GLOBALTEST (globaltest::gt) providing the sample group vector (argument response), the expression dataset (alternative), the gene sets (subsets), and the number of permutations (permutations). Defaults were used for all other arguments. From the resulting gt.object we extracted the p-value for each gene set using the p.value accessor method.
GSEA: Gene Set Enrichment Analysis. Uses a Kolmogorov-Smirnov (KS) statistic to test whether the ranks of the p-values of genes in a gene set resemble a uniform distribution. Computation of the cumulative KS statistic can be time-consuming. Implementation: R script [3]. GSEA (function GSEA of the script) was executed providing the expression matrix (argument input.ds), the sample group vector (input.cls), the gene sets in GMT format (gs.db), and the number of permutations (nperm). Defaults were used for all other arguments. From the two tab-separated global result text files (one for gene sets with positive enrichment score, one for gene sets with negative enrichment score), we extracted the p-value for each gene set from the NOM p-val column.
SAFE: Significance Analysis of Function and Expression. Implements a general permutation framework that allows to combine values of a local (per-gene) test statistic within a global (gene set) test statistic. Uses Wilcoxon's rank sum as default global statistic. Implemented in the safe package [4]. SAFE (safe::safe) was executed providing the expression matrix (argument X.mat), the sample group vector (y.vec), the gene × gene set incidence matrix (C.mat), and the number of permutations (Pi.mat). Defaults were used for all other arguments, resulting in SAFE using Student's t as the gene level statistic (local="t.Student") and Wilcoxon's rank sum as the gene set statistic (global="Wilcoxon"). From the resulting object we extracted the permutation p-value for each gene set from the global.pval slot.
GSA: Gene Set Analysis. Differs from GSEA by using the maxmean statistic, i.e. the maximum mean of either the positive or negative part of gene scores in the gene set. Uses a combination of sample permutation and gene permutation, which can result in longer runtimes. Implemented in the GSA package [5]. We executed GSA (GSA::GSA) providing the expression matrix (x), the sample group vector (y), the list of gene sets (genesets), and the number of permutations (nperms). Defaults were used for all other arguments, resulting in GSA using the maxmean statistic as method for computing the gene set statistic (method="maxmean"). From the result list we extracted the p-values for negative gene sets (pvalues.lo) and the p-values for positive gene sets (pvalues.hi), and calculated a single combined p-value for each gene set by taking the minimum p-value of the two directional p-values (positive / negative) multiplied by two.
SAMGS: Significance Analysis of Microarrays on Gene Sets. Implements a self-contained test that extends the SAM method [6] for single genes to gene set analysis. Implementation: R script [7]. SAMGS (function SAMGS of the script) was executed providing the expression matrix (argument DATA), the sample group vector (cl), the gene × gene set incidence matrix (GS), and the number of permutations (nbPermutations). From the result we extracted the p-value for each gene set from the GS p-value column.
ROAST: ROtAtion gene Set Test. Implements a self-contained test that uses rotation instead of permutation for assessment of gene set significance. Implemented in the limma [8] and edgeR [9] packages for microarray and RNA-seq data, respectively. We executed ROAST (limma::mroast) providing the expression matrix (argument y), the gene set index list (index), the experimental design matrix (design), and the number of rotations (nrot). For raw RNA-seq read counts, the expression matrix was provided as a DGEList with size factors and dispersions calculated. Other arguments were left unchanged and carried out using the corresponding default value. From the result list we extracted the column PValue, containing the two-sided directional p-value for each gene set, which was used to generate the presented results.
CAMERA: Correlation Adjusted MEan RAnk gene set test. Implements a competitive test that accounts for inter-gene correlations. Implemented in the limma [8] and edgeR [9] packages for microarray and RNA-seq data, respectively. We executed CAMERA (limma::camera) providing the expression matrix (argument y), the gene set index list (index), and the experimental design matrix (design). For raw RNA-seq read counts, the expression matrix was provided as a DGEList with size factors and dispersions calculated. Other arguments were left unchanged and carried out using the corresponding default value. From the result list we extracted the column PValue, containing the two-tailed p-value for each gene set, which was used to generate the presented results.
PADOG: Pathway Analysis with Down-weighting of Overlapping Genes. PADOG computes the mean of the absolute moderated t-scores of the genes in a gene set, but additionally incorporates gene weights to favor genes appearing in few pathways versus genes that appear in many pathways. Implemented in the PADOG package [10]. PADOG (PADOG::padog) was executed providing the expression matrix (argument esetm), the sample group vector (group), the list of gene sets (gslist), and the number of permutations (NI). From the resulting data.frame we extracted the column Ppadog, containing the p-value for each gene set, which was used for all further analysis.
GSVA: Gene Set Variation Analysis. Transforms the data from a gene by sample matrix to a gene set by sample matrix, thereby allowing the evaluation of gene set enrichment for each sample. Implemented in the GSVA package [11]. We executed GSVA (GSVA::gsva) providing the expression matrix (argument expr), the list of gene sets (gset.idx.list), and kcdf="Gaussian" for continuous expression values (microarray log-intensities, RNA-seq log-CPMs or log-TPMs) and kcdf="Poisson" for raw RNA-seq read counts. Other arguments were left unchanged and carried out using the corresponding default value. Accordingly, GSVA were carried out with mx.diff=TRUE (default), calculating the enrichment score as the difference between the maximum positive and negative deviations from zero of the random walk statistic. To analyze differences in the enrichment scores between sample groups, we applied limma as suggested in the vignette of the GSVA package.

S1.2 Enrichment tools
The goal of our benchmark study is a quantitative assessment of the performance of EA methods/algorithms as opposed to a comparison of EA tools, typically facilitating the execution of one or more EA methods on a number of existing gene set databases with different options for result exploration and visualization. Popular enrichment tools listed in Table 2 of the main manuscript typically work on a list of genes provided by the user, and use a version of ORA and/or GSEA to quantify enrichment of genes annotated to specific molecular functions or biological process as e.g. defined in the Gene Ontology or the KEGG pathway database. In the following, we describe which methods each tool implements. Please refer to the references provided in Table 2 of the main manuscript for more information.
DAVID: uses ORA. Tests the overlap between the input gene list and genes in a gene set based on the hypergeometric distribution. The statistical test is identical to our implementation of ORA, but is often used with a much larger background set (the whole genome) which typically results in less conservative GOSTATS: also provides two modes of hypergeometric testing. (1) standard ORA, but uses a nonspecific filtering strategy to arrive at a more realistic background set (similar to the strategy described in Supplementary Discussion S2. 2.2). (2) A conditional ORA that also takes into account the parent-child relationships between terms of the Gene Ontology.
WEBGESTALT: provides (1) standard ORA, identical to the one used by DAVID. (2) pre-ranked GSEA, which works on a ranked gene list instead of the full expression matrix [12]. (3) A network-based approach that performs random walk-based network propagation on an input gene list.
PANTHER: provides (1) standard ORA, identical to the one used by DAVID. (2) an enrichment test that is based on Wilcoxon's rank sum test, which is also used in SAFE per default.
ENRICHR: provides (1) standard ORA, identical to the one used by DAVID. (2) an empirical z-score method, that compares the observed rank for a gene set to the expected rank derived from performing ORA on many random input gene lists. (3) a combined score that multiplies the log of the ORA p-value from (1) with the z-score from (2). TOPPGENE: uses standard ORA, identical to the one used by DAVID. G:PROFILER: Similar to GORILLA in providing (1) standard ORA, and (2) application of ORA to all possible partitions of a ranked input list of genes and identification of the partition with minimum p-value.
GENETRAIL: supports enrichment analysis for (optionally ranked) input gene lists as the other tools above, but also implements methods for expression-based enrichment analysis when provided the full expression matrix (genes x samples). For simple gene lists, it provides standard ORA. For ranked gene lists that are optionally accompanied with scores it additionally provides pre-ranked GSEA and a number of additional set-level statistics. When provided a full expression matrix (such as an expression dataset from GEO), it allows similar to SAFE to choose from a number of local (gene-level) and global (set-level) statistics.

S1.3 Construction of gene set relevance rankings from MalaCards
As illustrated in Figure 1 of the main manuscript, we constructed gene set relevance rankings from MalaCards [13] in two steps. Focusing on the diseases investigated in the datasets of the benchmark compendia, we (1) systematically extracted relevant genes for each disease directly from the respective MalaCards page, and (2) subjected the disease genes for each disease to GeneAnalytics [14] to obtain the gene set relevance rankings for GO-BP terms and KEGG pathways. In the following we describe both steps in more detail. It is important to note that both steps of the curation process exclude any meaningful impact of (i) expression datasets of our benchmark compendium, and (ii) published gene set enrichment analyses of transcriptomic studies investigating diseases represented in our benchmark compendium. 1. Disease genes: According to https://www.malacards.org/pages/info#related_genes, the curation process of disease genes takes into account: • GeneCards [15] textual association, • Genetic testing resources supplying specific genetic tests for the disease, • Genetic variations resources supplying specific causative variations in genes for the disease, • Resources that manually curate the association of the disease with genes, The sources for these evidence categories are (see Table 3  According to https://www.malacards.org/pages/info#scores these evidence categories are combined into a composite score consisting of co-occurrences of gene and disease in GeneCards summaries, as well as the targeted experiments listed above, which do not include (1) high-throughput expression assays or (2) gene set enrichment analyses. As an example, see the Genes for Breast Cancer section of https://www.malacards.org/card/breast_ cancer#related_genes. For each gene it lists up to nine evidence categories including Molecular basis known, Pathogenic, Causative variation, Genetic Tests, and Susceptibility factor. One category is GeneCards inferred of which one sub-category is Publications. While this last category could include co-occurring text from the publications of our benchmark datasets, we note that our cancer vs. normal contrasts each generated from hundreds to thousands of differentially expressed genes. Even if a few of these genes were mentioned in the publications of the benchmark datasets, the impact on the MalaCards relevance score would be negligible.

Gene set relevance rankings:
Given the list y of relevant genes for a disease D, GeneAnalytics [14] computes a composite relevance score S(x) for a gene set x of interest (here: a GO-BP term or a KEGG pathway) via where R(x) is the rank of gene set x, when ordering gene sets first by their SetDistiller [16] p-value, and then by gene set size. S D (g) is the disease relevance score of a gene g as derived in Step 1 above, and is taken into account for all genes in z = x ∩ y, i.e. for genes that are shared between gene set x and the list y of relevant genes for disease D. N S is the number of data sources supporting the disease relevance of genes in the gene set.
We observe that the gene set relevance score has three main components: is based on the SetDistiller [16] p-value, which is calculated from the binomial distribution, testing the null hypothesis that the frequency of gene set x in the query set y is not significantly different from what is expected with a random sampling of genes, given the frequency of the x in the set of all genes, 2. g∈z S D (g) summarizes the per-gene disease relevance for genes shared between the gene set x and the disease gene set y, and 3. N S is the number of sources listed on a GeneCards page that support a link between genes in gene set x and disease D (see evidence categories and sources listed above).
We note that the SetDistiller test comes methodologically close to a hypergeometric ORA, but is carried out on the list of disease genes (as opposed to the list of DE genes for each dataset of the benchmark compendium). Further, it is used here as an additional weighting factor, where the main contribution to the gene set relevance score comes from the product of the per-gene disease relevance scores. S1. 4 Comparison of the relevance score to alternative measures S1. 4

.1 Motivation and properties of the relevance score X m(d)
In the manuscript we motivate the score as a measure of phenotype relevance accumulated along a gene set ranking. The score X m(d) for method m applied to dataset d therefore sums up the individual gene set relevance scores, weighted by the relative position of each gene set in the ranking of method m. Gene sets that are ranked towards the top obtain a high weight, and vice versa. The goal of the score is to generalize the approach we take when manually inspecting a gene set ranking for phenotype relevance: we check whether top ranked gene sets ("positives") have relevance for the phenotype as reported in the literature ("true positives"). In general, we are thus interested in detecting a gene set if it is relevant. Researchers rarely inspect the bottom of the ranking / insignificant gene sets ("negatives"), and check whether these gene sets indeed have no relevance for the phenotype ("true negatives"). True negatives are also more difficult to establish as it can rarely be definitively determined whether a gene set is indeed irrelevant for a phenotype, or whether the gene set's relevance has simply not been studied or established yet (investigation bias / observational bias). For additional properties and limitations of the relevance score X m(d) with respect to comparability between datasets and the presence of ties, we refer to Methods, Section 2.6 Phenotype relevance.

S1.4.2 Alternative measures (true negative rate, ROC/AUC, cor, ...)
It is instructive to inspect other measures. The evalRelevance function therefore accepts an argument method that determines how the relevance score is summarized across the enrichment analysis ranking (see the documentation of the function in the reference manual of the package). Choices for the method argument include: • "wsum" to compute a weighted sum of the relevance scores (default, corresponds to X m(d) ); • "auc" to perform a ROC/AUC analysis; • "cor" to compute a correlation; • or a user-defined function for customized behaviors.
Instead of "auc", this can also be any other performance measure that the ROCR package implements, for example "tnr" for calculation of the true negative rate.
However, the following considerations apply: -ROC / AUC / true negative rate: It is tempting to treat the comparison of the EA ranking and the MalaCards ranking as a classification problem. However, this requires to divide (i) the EA ranking into enriched (positive) and not enriched (negative) gene sets, and (ii) the MalaCards ranking into relevant and irrelevant gene sets. Both steps are not straightforward and require the definition of thresholds. For (i), universal thresholding on the gene set p-value seems to lead to either overly small or large proportions of enriched gene sets depending on the method (Figure 3). And for (ii), the MalaCards ranking seem to rather provide varying degrees of relevance for a subset of gene sets, as opposed to a binary categorization as either relevant or irrelevant for all gene sets. Also, given the above considerations on the difficulty of establishing real true negatives, it is our understanding that absence from a MalaCards relevance ranking does not imply irrelevance for the phenotype per se. As a compromise, we consider a defined number of gene sets at the top of the EA ranking as enriched and, analogously, a defined number of gene sets at the top of the MalaCards ranking as relevant.
Let us consider the top 10 gene sets of the EA ranking as enriched and the top 10 gene sets of the MalaCards ranking as relevant. An optimal AUC of 1 is then achieved if all 10 relevant gene sets are placed (in arbitrary order) among the top 10 enriched gene sets. Note that such an analysis therefore does not allow to account for varying degrees of relevance among the 10 relevant gene sets: an optimal EA ranking placing the 10 relevant gene sets in the order of the relevance ranking achieves an AUC of 1; but also a suboptimal EA ranking that places the 10 relevant gene sets in reverse order at the top achieves an AUC of 1.
1 # s e t u p 2 > l i b r a r y ( GSEABenchmarkeR ) This illustrates two beneficial aspects of the weighted sum X m(d) : (1) it avoids any artificial thresholding on the EA ranking by calculating weights that express whether gene sets are rather ranked towards the top or the bottom of the ranking, and (2) it accounts for varying degrees of relevance in the MalaCards ranking. It thereby faithfully distinguishes between EA rankings that accumulate high phenotype relevance, but each ranking to a slightly different extent. 1 # o p t i m a l EA r a n k i n g 2 # ( 1 0 r e l e v a n t pathways a t t h e top , i n t h e o r d e r o f t h e r e l e v a n c e r a n k i n g ) 3 > e v a l R e l e v a n c e ( ea . ranks , r e l . ranks , method="wsum" ) 4 [ 1 ] 9 0 8 . 1 5 9 2 5 6 # s u b o p t i m a l EA r a n k i n g 7 # ( 1 0 r e l e v a n t pathways a t t h e top , but i n r e v e r s e o r d e r ) 8 > e v a l R e l e v a n c e ( ea . r a n k s . rev , r e l . ranks , method="wsum" ) 9 [ 1 ] 9 0 5 . 8 3 6 7 -Correlation: The main argument against using a standard correlation measure comes from the missingness in the relevance rankings (Supplementary Figure S12a and c). Thus, for a dataset and associated phenotype, we compare an EA ranking going over the full gene set vector (N ≈ 300 for KEGG; N ≈ 4, 600 for GO-BP) against the typically much smaller vector of gene sets for which a relevance score is annotated. For this scenario, using rank correlation reduces the question to "does a subset of the EA ranking preserve the order of the relevance ranking"; although our question of interest is rather "is a subset of the relevant gene sets ranked highly in the EA ranking".
Consider the case of a relevance ranking containing 10 relevant gene sets against an EA ranking of 323 gene sets. Rank correlation (using stats::cor with use = "pariwise.complete.obs" and method = "spearman") is then computed between the relevance ranking and the EA ranking restricted to the 10 relevant gene sets. This accordingly results in very different correlations for (i) an EA ranking that places the 10 relevant gene sets at the top (in the order of the relevance ranking), and (ii) an EA ranking that also places the 10 relevant gene sets at the top, but in reverse order.
1 # c o n t i n u i n g with t h e r e l e v a n c e r a n k i n g f o r Alzheimer ' s d i s e a s e from above , . . . 2 # . . . but r e s t r i c t i n g t h e r e l e v a n c e r a n k i n g t o 10 pathways 3 > r e l . r a n k s <− r e l . r a n k s [ 1 : 1 0 , ] Here, the weighted sum X m(d) again has preferable properties, as it recognizes that both EA rankings accumulate high phenotype relevance at the top. Yet, as desired, a slightly higher score is obtained for the optimal ranking. 1 # S c o r e o f t h e o p t i m a l EA r a n k i n g 2 # ( 1 0 r e l e v a n t pathways a t t h e top , i n t h e o r d e r o f t h e r e l e v a n c e r a n k i n g ) 3 > e v a l R e l e v a n c e ( ea . ranks , r e l . ranks , method="wsum" ) 4 [ 1 ] 5 3 9 . 7 6 8 8 t i m a l EA r a n k i n g 7 # ( 1 0 r e l e v a n t pathways a t t h e top , but r e v e r s e o r d e r ) 8 > e v a l R e l e v a n c e ( ea . r a n k s . rev , r e l . ranks , method="wsum" ) 9 [ 1 ] 5 3 7 . 4 4 6 3

S1.5 Executable benchmark system
The GSEABenchmarkeR package is implemented in R [17] and is available from Bioconductor [18] under http://bioconductor.org/packages/GSEABenchmarkeR. The package allows to (i) load specific pre-defined and user-defined data compendia, (ii) carry out DE analysis across datasets, (iii) apply EA methods to multiple datasets, and (iv) benchmark results with respect to the chosen criteria.
Loading of benchmark compendia is facilitated through the loadEData function, which simplifies access to (i) the pre-defined GEO2KEGG microarray compendium, (ii) the pre-defined TCGA RNA-seq compendium, and (iii) user-defined data from file. Datasets of the GEO2KEGG microarray compendium are loaded from the Bioconductor packages KEGGdzPathwaysGEO and KEGGandMetacoreDzPathwaysGEO [19,20]. Probe-to-gene mapping for each dataset can optionally be carried out, in order to summarize expression levels for probes annotated to the same gene. Datasets of the TCGA RNA-seq compendium are loaded using the curatedTCGAData package [21] (TPMs) or from GSE62944 [22,23] (raw read counts). User-defined data is also accepted, requiring a file path to the directory where datasets have been saved as serialized R data files (Supplementary Methods S1.8).
Caching to flexibly save and restore an already processed expression data compendium is incorporated by building on functionality of the BiocFileCache package [24]. This is particularly beneficial as preparing an expression data compendium for benchmarking of EA methods can be time-consuming and can involve several pre-processing steps.
DE analysis between sample groups for selected datasets of a compendium can be carried out using the function runDE. The function invokes deAna on each dataset, which contrasts the sample groups depending on data type and user choice via limma/voom, edgeR, or DESeq2.
Enrichment analysis At the core of applying a specific EA method to a single dataset is the runEA function, which delegates execution of the chosen method to either sbea (set-based enrichment analysis) or nbea (network-based enrichment analysis). Both functions also accept user-defined enrichment methods (Supplementary Methods S1.7, [1]). In addition, runEA returns CPU time used and allows saving results for subsequent assessment.
Parallel computation of functions for microarray preprocessing, DE analysis, and enrichment analysis when applied to multiple datasets is realized by building on infrastructure implemented in the BiocParallel package [25]. Internally, these functions call bplapply, which per default triggers parallel computation as configured in BiocParallel's registry of computation parameters. As a result, parallel computation is implicitly incorporated when calling these functions on a multi-core machine. To change the execution mode of these functions, accordingly configured computation parameters can either directly be registered, or supplied as an argument to the respective function. Distributed computation on an institutional computer cluster or a computing cloud is straightforward by similarly configuring a computation parameter of class BatchtoolsParam for that purpose.
Benchmarking Once methods have been applied to a chosen benchmark compendium, they can be subjected to a comparative assessment using dedicated functions for loading, evaluation, and visualization of the results. The function evalNrSigSets evaluates the fraction of significant gene sets given a significance level alpha and a method for multiple testing correction, which can be chosen from the methods implemented in p.adjust from the stats package. The function evalRelevance evaluates phenotype relevance between EA rankings and corresponding relevance rankings, given a mapping from dataset to phenotype investigated. Integrated relevance rankings can be refined and relevance rankings for additional datasets can also be incorporated (Supplementary Methods S1.9). Detailed documentation of all implemented functions is available in the reference manual of the package.

S1.6 Benchmarking network-based methods
Benchmarking with the GSEABenchmarkeR package extends to network-based methods that incorporate known gene regulatory interactions. For demonstration, we execute two network-based methods (SPIA [26] and GGEA [27]) on three datasets of the GEO2KEGG microarray compendium, and compare their runtimes on these datasets. 1 # s e t u p 2 > l i b r a r y ( GSEABenchmarkeR ) 3 > l i b r a r y ( EnrichmentBrowser ) 4 5 # p r e p a r e 6 > g e o 2 k e g g <− loadEData ( " g e o 2 k e g g " , nr . d a t a s e t s =3 , c a c h e=FALSE) 7 > g e o 2 k e g g <− maPreproc ( g e o 2 k e g g ) 8 > g e o 2 k e g g <− runDE ( g e o 2 k e g g ) 9 10 # g e t KEGG gene s e t s 11 > kegg . g s <− g e t G e n e s e t s ( o r g=" hsa " , db=" kegg " ) 12 13 # c o m p i l e a gene r e g u l a t o r y network from KEGG 14 > kegg . grn <− compileGRN ( o r g=" hsa " , db=" kegg " ) 15 16 # e x e c u t e SPIA and GGEA on t h e t h r e e d a t a s e t s 17 > r e s <− runEA ( geo2kegg , 18 methods=c ( " s p i a " , " ggea " ) , Listing 1: Benchmarking network-based methods

S1.7 Benchmarking user-defined methods
User-defined enrichment methods can easily be plugged into the benchmarking framework. For demonstration, we define a dummy enrichment method that randomly draws p-values from a uniform distribution. We then execute this method on datasets of the GEO2KEGG compendium and inspect the percentage of significant gene sets returned for each dataset.   16 # g e t t h e r a n k i n g s 17 > r a n k s <− r e a d R e s u l t s ( data . d i r="~/ method_bench " , data . i d s=names ( g e o 2 k e g g ) , 18 methods=" method " , t y p e=" r a n k i n g " ) 19 20 # e v a l u a t e t h e p e r c e n t a g e o f s i g n i f i c a n t gene s e t s 21 > s i g . s e t s <− e v a l S i g S e t s ( ranks , p a d j=" none " , a l p h a = 0. 05 ) 22  Listing 2: Benchmarking user-defined methods

S1.8 Incorporating user-defined benchmark compendia
The benchmarking can be straightforward extended to additional datasets. The loadEData function accepts a directory where datasets of class SummarizedExperiment [28] are stored as RDS files [29].

S1.9 Incorporating user-defined relevance rankings
It is also possible to refine the integrated MalaCards relevance rankings or to incorporate relevance rankings for additional datasets. For demonstration, we define an exemplary relevance ranking for 10 gene sets, and evaluate the relevance accumulated by an exemplary EA ranking.
1 # ( 1 ) p r o d u c i n g an EA r a n k i n g 2 > ea . r a n k s <− makeExampleData ( " ea . r e s " ) 3 > ea . r a n k s <− gsRanking ( ea . ranks , s i g n i f . o n l y=FALSE) 4 > ea . r a n k s 5 DataFrame with 10 rows and 2 columns # ( 2 ) d e f i n i n g a r e l e v a n c e s c o r e r a n k i n g 20 > r e l . r a n k s <− ea . r a n k s 21 > r e l . r a n k s [ , 2 ] <− round ( r u n i f ( nrow ( ea . r a n k s ) , min=1 , max=100) ) 22 > c o l n a m e s ( r e l . r a n k s ) [ 2 ] <− "REL .SCORE" 23 > rownames ( r e l . r a n k s ) <− r e l . r a n k s [ , "GENE. SET" ] 24 > i n d <− o r d e r ( r e l . r a n k s [ , "REL .SCORE" ] , d e c r e a s i n g=TRUE) 25 > r e l . r a n k s <− r e l . r a n k s [ ind , ] 26 > r e l . r a n k s 27 DataFrame with 10 rows and 2 columns Listing 4: Incorporating user-defined relevance rankings

S2.1 Self-contained vs. competitive
It is not always trivial to categorize methods as either competitive or self-contained, and several methods combine aspects from both models. For example, GSEA and SAFE are hybrid in the sense that they motivate their test statistic on the basis of a competitive gene-sampling model, but calculate their pvalue in a self-contained subject-sampling manner [30]. This similarly applies for GSA, which computes a self-contained test statistic and calculate the p-value in a self-contained subject-sampling manner, but uses a competitive gene-sampling procedure for restandardization of the observed and permuted values of the test statistic, making GSA effectively competitive. PADOG also uses sample permutation and restandardization via gene permutation, but computes a mean of the absolute gene scores weighted by the occurrence frequency of genes across all gene sets tested. The classification of GSEA further depends on the execution mode: for small sample sizes, GSEA provides an argument to use gene permutation for the p-value calculation, making it fully competitive. When using sample permutation, GSEA can also be executed fully self-contained if the Kolmogorov-Smirnov statistic is calculated on the basis of the DE p-values for each gene in the gene set, instead of on their ranks [30]. Another interesting example is GSVA, that belongs to the class of single sample EA methods and thus takes a conceptually different approach than all other methods assessed in this paper. The other methods (1) analyze differential expression of individual genes between sample groups, and (2) summarize DE of individual genes across the gene set of investigation. Applied in a comparison of sample groups, GSVA reverses the typical approach by (1) computing gene set enrichment scores for each sample, and (2) testing for differential "expression" of these enrichment scores between sample groups using e.g. limma. While the second step is a self-contained significance assessment of the enrichment scores for each gene set, GSVA computes the enrichment score for each sample like GSEA based on the KS-statistic in an "unsupervised" competitive way, i.e. without taking the sample classification into account [31]. Interestingly, we observed this approach to be effectively self-contained and closely resembling results obtained for ROAST, a fully self-contained method. We further note that such distinctions are not necessary for the competitive methods ORA and CAMERA, and the self-contained methods GLOBALTEST, SAMGS, and ROAST.

S2.2.1 Choosing the DE genes
DE studies typically report a gene as differentially expressed if the corresponding DE p-value, corrected for multiple testing, satisfies the chosen significance level. EA methods that work directly on the list of DE genes are then substantially influenced not only by the DE method, but also by the method for multiple testing correction. ORA is inapplicable if there are few genes satisfying the significance threshold, or if almost all genes are DE. We therefore implemented a flexible, context-dependent adjustment procedure to account for such cases by applying multiple testing correction in dependence on the overall DE level in the dataset: • the correction method from Benjamini and Hochberg (BH) is applied, if it renders ≥ 1% and ≤ 25% of all measured genes as DE, • the p-values are left unadjusted, if the BH correction results in < 1% DE genes, and • the more stringent Bonferroni correction is applied, if the BH correction results in > 25% DE genes.
Note that resulting p-values are not further used for assessing the statistical significance of DE genes within or between datasets. They are solely used to determine which genes are included in the analysis with ORA -where the context-dependent correction ensures that the fraction of included genes is roughly in the same order of magnitude across datasets.

S2.2.2 Choosing the background
Competitive gene set tests such as ORA compare the genes of the gene set tested against the background of genes not in the set [30]. Although rarely explicitly stated, the background is thus an important parameter [32], especially for the hypergeometric test used by ORA where it determines the size of the population from which genes are drawn [33]. We consider three different options for the population: (i) all genes measured in the microarray or RNA-seq experiment under study, (ii) all genes annotated in the gene set collection under study, and (iii) the intersection of (i) and (ii). While the differences between these three options seem subtle, the impact on the significance estimation of the hypergeometric test can be substantial. We illustrate the impact of the choice of the background by considering a transcriptomic study, in which 12,671 genes have been tested for differential expression between two sample conditions and 529 genes were found DE. Among the DE genes, 28 are annotated to a specific gene set, which contains in total 170 genes. This setup corresponds to a 2 x 2 contingency table, where the overlap of 28 genes can be assessed based on the hypergeometric distribution. This corresponds to a one-sided version of Fisher's exact test, yielding here a highly significant enrichment. This setup would be realistic if all genes of the universe have equal chance to be drawn. However, due to overlaps between gene sets and missing annotation for other genes, some genes are preferentially drawn, and some genes cannot be drawn at all. To account for missing annotation, we restrict the population to genes annotated in the gene set collection under study. We illustrate this by using the human KEGG gene set collection that contains roughly 8,000 genes. The resulting p-value of the hypergeometric test drops by 4 orders of magnitude. 1 > kegg . g s <− EnrichmentBrowser : : g e t G e n e s e t s ( o r g=" hsa " , db=" kegg " ) 2 > l e n g t h ( u n i q u e ( u n l i s t ( kegg . g s ) ) ) 3  A similar argument can be made for genes in the gene set collection that are not measured, which is more common for microarray studies than for RNA-seq studies. To account for such genes, we restrict the population to the intersection of measured genes and annotated genes. For the example considered here, we assume the intersection to be 7,000 genes, reducing the p-value by another order of magnitude.

S2.3 Limitations of the MalaCards relevance rankings
The MalaCards relevance rankings represent a systematic approach to quantifying phenotype relevance of gene sets based on experimental evidence and co-citation in the literature. However, there are three main limitations that leave room for future improvements: • The MalaCards relevance rankings are incomplete. As the rankings are constructed from disease relevance of individual genes, only gene sets that contain at least one relevant gene are included in the rankings. However, in agreement with the considerations in Supplementary Methods S1.4 on investigation bias and observational bias, absence from a relevance ranking does not imply irrelevance for the phenotype per se. On the other hand, containing one or more relevant genes alone does not neccessarily imply relevance of the gene set as a whole, and a gene set's relevance might further depend on interplay between genes and context dependency of their activation.
• The relationship between dataset, investigated phenotype, and associated relevance ranking is not always clear-cut. Supplementary Figures S17 and S18 display the relevance score distribution of the overall high-scoring methods (the six competitive methods PADOG, ORA, SAFE, GSEA, GSA, CAMERA), identifying datasets of both benchmark compendia where those methods consistently return low scores. For those datasets further curational effort is required to clarify whether this inconsistency between the observed expression (in the dataset) and the expected expression (from the associated MalaCards relevance ranking) is due to (i) a not well-defined contrast between cases and controls, or (ii) a relevance ranking that is mainly based on experimental evidence types that are not detectable on the transcriptomic level.
• The relevance rankings have limited discriminatory power for related diseases. Supplementary Figure S14 demonstrates that the relevance rankings are partly very similar for different cancer types (especially the KEGG rankings). While this is expected to a certain extent given the universality of many cancer driver genes [34] and oncogenic pathways [35], supplementing the rankings with more fine-grained cancer type-specific information will likely also improve the accuracy of the phenotype relevance evaluation of the GSEA methods.    [36] and in the Introduction section of the main manuscript. 3 A directional method tests whether genes in the set tend to be either predominantly up-or down-regulated; a mixed method tests whether genes in the set tend to be differentially expressed, regardless of the direction. 4 Analysis of pre-ranked list of genes, which is useful for scenarios where the full expression matrix is not available. ORA: choose from the tools listed in Table 2 of the main manuscript. GSEA: popular implementations include GSEAPreranked [12] and fgsea [37]. CAMERA: part of the limma [8] package.                Figure S14: Similarity of MalaCards relevance rankings. The heatmap shows the percentage of the optimal phenotype relevance score on a color scale. The optimal relevance score corresponds to the case that two relevance rankings are identical. The heatmap in (a) for KEGG shows high similarity between the relevance rankings for cancer types (large red cluster in the upper right), neurodegenerative diseases (ALZ, PARK, HUNT), and previously linked autoimmune / chronic inflammatory lung diseases (LES / PDCO, [38]). These similarities are also apparent to a lesser extent for GO-BP relevance rankings in (b), demonstrating higher similarity of relevance rankings within disease classes than between disease classes.
Figure S15: Phenotype relevance for the top 20% of each EA ranking. Percentage of the optimal phenotype relevance score (y-axis) when applying methods to the GEO2KEGG microarray compendium (top, 42 datasets) and the TCGA RNA-seq compendium (bottom, 15 datasets). Gene sets were defined according to KEGG (left, 323 gene sets) and GO-BP (right, 4,631 gene sets). The grey dashed line divides methods based on the type of null hypothesis tested. Computation of the phenotype relevance score is outlined in Figure 1 of the main manuscript and detailed in Methods, Section Phenotype relevance.