Gene set selection via LASSO penalized regression (SLPR)

Abstract Gene set testing is an important bioinformatics technique that addresses the challenges of power, interpretation and replication. To better support the analysis of large and highly overlapping gene set collections, researchers have recently developed a number of multiset methods that jointly evaluate all gene sets in a collection to identify a parsimonious group of functionally independent sets. Unfortunately, current multiset methods all use binary indicators for gene and gene set activity and assume that a gene is active if any containing gene set is active. This simplistic model limits performance on many types of genomic data. To address this limitation, we developed gene set Selection via LASSO Penalized Regression (SLPR), a novel mapping of multiset gene set testing to penalized multiple linear regression. The SLPR method assumes a linear relationship between continuous measures of gene activity and the activity of all gene sets in the collection. As we demonstrate via simulation studies and the analysis of TCGA data using MSigDB gene sets, the SLPR method outperforms existing multiset methods when the true biological process is well approximated by continuous activity measures and a linear association between genes and gene sets.

Existing multiset methods GenGO by Lu et al. [2], Markov chain ontology analysis (MCOA) by Frost and McCray [3], model-based gene set analysis (MGSA) by Bauer et al. [4,5] and multifunctional analysis (MFA) by Wang et al. [6] all share a similar generative model for the observed genetic data in terms of gene set activation. This generative model assumes that there are p genomic variables grouped into m overlapping gene sets as defined by an m × p gene set indicator matrix A: A m,1 · · · A m,p    , A i,j = 1(genomic variable j belongs to gene set i) The gene sets are assumed to be in either an active or an inactive state with the true activity state represented by a length m vector of indicator variables S: .., S m }, S i = 1(gene set i is active) The genomic variables are also either active or inactive with the true state represented by a length p vector of indicator variables T: T = {T i , ..., T p }, T i = 1(gene i is active) According to this model, the true gene activity state can be computed as a function of A and S. Specifically, it is assumed that a gene is active if it belongs to any active set: The observed experimental data for the p genetic variables is represented by a length p vector of indicator variables O: The generative model for the observed data based on the true states is specified by the conditional distribution of O i given T i : where α is the false positive rate and 1 − γ is the false negative rate and the O i are mutually independent conditional on all true states T. According to this generative model, the log-likelihood of the observed genomic data given the true activity states can be specified as: Note that according to Eq (4) it is possible to compute a value for the log-likelihood specified in Eq (7) given the gene set definitions in A and the true gene set activity states in S. Given this model, the goal of multiset methods GenGO, MCOA, MGSA and MFA is to estimate the true activity states of all gene setsŜ, and thus genesT, based on the observed data O and gene set definitions A. Where these four multiset methods differ is in the approach they take to generate the estimates. GenGO uses a greedy search technique to identity the set of active gene sets,Ŝ, that maximize a penalized version of the log-likelihood specified in Eq (7), where the penalty is proportional to the number of active sets. MCOA uses a modified version of the GenGO penalized MLE approach where the regularization constant is computed using the eigenvector centrality from a Markov chain model of the gene sets and genomic variables. Both MGSA and MFA take a Bayesian approach to estimate the posterior distribution of gene set activation with the maximum posterior used forŜ andT. MGSA and MFA differ in the form of the prior distribution assumed for S and the constraints placed on gene activation given gene set activation (MFA uses a prior consistent with a more restrictive activation hypothesis that requires a gene set to be active if all member genes are active).

Mathematical details of SLPR method
The SLPR method supports the analysis of experiments in which p genomic variables, e.g., expression levels for mRNA molecules, along with a set of other covariates of interest are measured under n independent experimental conditions. It is assumed that prior knowledge allows the genomic variables to be grouped into a collection of m overlapping sets, where each set is associated with a specific biological function, e.g., Gene Ontology (GO) terms. For such experiments, research interest typically focuses on the statistical association between one of the covariates (e.g., case/control status) and each of the m gene sets.

Required data
Similar to the multiset methods GenGO, MCOA, MGSA and MFA detailed above, the SLPR method requires two inputs: 1) a summary statistic for each of the p genomic variables that quantifies the importance of the variable in a specific experiment, and 2) membership information for a collection of gene sets defined over the p genomic variables.
It is assumed that the gene set collection is specified using an indicator matrix A as defined in Eq (1). Depending on the gene set collection that is used to populate A, it is possible that the analyzed collection will miss gene sets that are truly active and are correlated with gene sets in A. In this situation, the results from SLPR for a specific gene set can vary significantly depending on whether or not the other active and correlated set is included in the collection. This case is equivalent to the problem of omitted variable bias in regression. Although an important limitation, it does not negatively impacts the performance of the SLPR method relative to other multiset methods since they are all impacted by this challenge. Although self-contained uniset methods avoid this specific issue (i.e., the results for a given set do not depend on other tested sets), competitive uniset methods may be impacted if the tested competitive H 0 involves a comparison of the statistic for each gene set against the statistics for all other sets in the collection. The gene-level summary statistics are assumed to be contained in a length p vector Z: Although the vector Z serves a similar purpose as the vector O defined in Eq (3), it is being represented using a separate variable to clarify the binary vs. continuous distinction. Similar to the assumption of independence made regarding the elements of O, it is assumed that the elements of Z are independently measured. While any desired gene-level summary statistic can be used with the SLPR method, it is common with gene set testing methods to use a gene-level statistic that captures the association between each genomic variable and one the covariates, e.g., the estimated coefficient in a linear regression model of the genomic variable on the covariate or the t-statistic associated with that coefficient estimate. If X i is a random variable representing the i th genomic variable and Y is a random variable representing the covariate of interest, then such a gene-level test statistic would be the estimatedβ 1 in the model E[X i |Y ] = β 0 + β 1 Y or a standardized version of the estimate, i.e.,β 1 /s.e.(β 1 ). For the SLPR method, it is preferable to set Z to effect size estimates that have a clear biological interpretation, i.e.,β 1 , vs. measures based on just the statistical significance of the association, i.e.,β 1 /s.e.(β 1 ). For the real data example, Z is set to an effect size estimate that takes into account the confidence interval forβ 1 . An optional transformation can also be applied to each of the Z i statistics. One popular transformation is the absolute value transformation, i.e., Z i = |Z i |. Such a transformation provides increased power to detect scale alternatives, i.e. gene sets that containing both significantly enriched and repressed genomic variables, whereas the use of untransformed gene-level test statistics has superior power against shift in location alternatives, i.e., gene sets containing just genes with a common direction of association [7].

SLPR model
The SLPR method assumes a biological model under which the measured value of each genomic variable reflects the concurrent activity of multiple biological processes or pathways, where each potentially active process or pathway is defined by one of the m gene sets defined in the indicator matrix A defined in Eq (1). This model further assumes that the gene-level summary statistics Z i , i = 1, ..., p, can be modeled by a linear function of statistics associated with all of the gene sets to which each variable belongs, where the set-level statistics quantify the activity-level of the entire process or pathway during the experiment. This biological model can be represented statistically by the following multiple linear regression model: where • Z is the vector of gene-level summary statistics defined in Eq (8).
• β 0 is a regression coefficient that captures the average value of gene-level summary statistics when the genomic variables are not associated with any active gene sets.
• A is the gene set annotation matrix defined in Eq (1).
• β is an m-dimensional vector of gene set-level statistics that quantify the activity of each set during the experiment. This serves a similar purpose as the vector S defined in Eq (2).
Although statistically straightforward, this model is a novel and effective mapping of the multiset gene set testing problem to multiple linear regression. An important implication of this model is the assumption that the activity of multiple gene sets has an additive impact on the gene-level statistics Z. If, for example, Z represents the relative abundance of gene products (e.g., mRNA molecules) under different environmental conditions, the activity of two gene sets with independent functions and activity levels of similar magnitude and direction that both contain the same gene i would be expected to produce a statistic Z i roughly twice as large as the statistic value generated when only one of the two gene sets is active. Other, more complex, models of gene activity can be supported with the SLPR method by setting the elements of the A matrix to desired non-1 values (e.g., replace the elements in each row of A by a value that reflects the relative contribution of the associated gene to the activity of the gene set). See Section 1.2.4 below for more details on possible modifications to the A matrix.

SLPR estimation
Estimation of gene set activity given data for p genomic variables under n experimental conditions and gene set definitions in matrix A defined in Eq (1) is performed using the following sequence of steps: The vector Z of statistics defined in Eq (8) captures the importance of the p genomic variables during the experiment and can be computed using the observed experimental data, e.g., via a regression model, or obtained from a source of prior biological knowledge. Any desired continuous gene-level summary statistics can be employed with the SLPR method. Of course, this includes categorical values as a special case so SLPR can be executed on the same binary vector O defined in Eq (3) used by the existing multiset methods GenGO, MCOA, MGSA and MFA.
2. Solve a LASSO penalized version of the model in Eq (9).
The model in Eq (9) is fit using a LASSO [8] penalty that performs both variable selection and coefficient shrinkage. Specifically, the following objective function is maximized: The penalty parameter λ can be selected according to cross-validation or to achieve a specific number of non-zero coefficients. Note that the intercept term, β 0 , is not penalized. For the results presented in main manuscript, the model defined by Eq (10) was fit using the glmnet R package [9] with λ set via 10-fold cross-validation to the value of λ corresponding to the minimum mean cross-validation error (i.e., minimum mean squared error for this case). To reduce the variance associated with the random splitting of the data, the cross-validation process can be repeated multiple times with λ set to the mean value.
Although we believe that LASSO penalization is the optimal estimation approach for the SLPR regression model, it is important to note that SLPR does not require LASSO penalization and could be realized (albeit with degraded computational performance and model selection characteristics) using other regression methods that support variable selection and the p > n use case, e.g., smoothly clipped absolute deviation (SCAD) [10] or even standard forward stepwise regression. (9) retaining only those predictors that have non-zero coefficients at the optimal LASSO penalty level.

Solve an unpenalized version of the model in Eq
The LASSO-penalized regression is followed by an unpenalized regression using just the predictors with non-zero coefficient estimates in the LASSO fit. This two-stage, so-called Gauss-Lasso [11] approach retains the model selection benefits of the LASSO while also generating non-shrunken coefficient estimates and approximate measures of statistical significance. As discussed in Javanmard et al. [11], the Gauss-Lasso technique in fact supports model selection consistency under the much more broadly applicable generalized irrepressibility condition. The LASSO is only model selection consistent under the irrepressibility condition, which can be interpreted as requiring orthogonality between the predictors with true zero coefficients and the predictors with true non-zero coefficients [12,13].
4. Use the estimates ofβ from the penalized or unpenalized regressions to identify a ranked list of biologically relevant gene sets.
The members of this ranked list are all gene sets k, k = 1, ..., m, associated with a non-zero element inβ, i.e.,β k = 0 from the LASSO-penalized regression. Ranking is determined by the absolute value of the the estimatedβ i coefficients from either the penalized or unpenalized regression models. If the gene-level statistics Z capture the direction of association, then the sign of the estimatedβ i statistics can be used to determine an enrichment direction for each gene set. Ranking could alternatively be performed using the approximate statistical significance of the coefficient estimates in the unpenalized model.

SLPR extensions
SLPR can be easily extended to support more complex analyses that involve: • Covariate adjustment To control for covariates using the SLPR method, the gene-level summary statistics Z defined in Eq (8) can be computed using a regression model that supports covariate adjustment, e.g., multiple regression using a generalized linear model [14]. This assumes that instance-level data is available.
• Weights for gene sets If specific gene sets are known a priori to be more important than others and this can be quantified using continuously valued weights sw i , i = 1, ..., m, this prior knowledge can be incorporated into the SLPR method by applying the gene set weights sw i as predictor penalty factors in the elastic net objective function as follows: • Weights for genes If specific genomic variables are known a priori to be more important than others and this can be quantified using weights gw i , i = 1, ..., p, this knowledge can be integrated into the SLPR method through adjustment of the gene set indicator matrix A defined in Eq (1). If the weights gw i apply uniformly across all m gene sets, then an adjusted A can be computed as A = A diag(gw 1 , ..., gw p ), where diag(gw 1 , ..., gw p ) represents a p × p matrix with the weights gw i on the diagonal. If, on the other hand, the genomic variables have gene set-specific weights, gw j i for genomic variable i relative to gene set j, then the adjusted matrix is computed as A = A * [gw 1 , ..., gw m ] T , where * represents element-wise multiplication and [gw 1 , ..., gw m ] is a p × m matrix whose columns contain the genomic variable weights specific to each gene set.

• Gene set testing for single samples
To support single sample analysis, the outcome vector Z can be populated with measurements for the p genes computed on one sample rather than with gene-level test statistics.

Model assessment
To help researchers decide whether a given data set is better fit by the SLPR model or by a model similar to that used by existing multiset methods like MGSA, we have devised an approximate model assessment test based on a comparison of Akaike information criterion (AIC) values for two non-nested linear regression models: • Unpenalized SLPR model from the second stage of the Gauss-Lasso estimation.
• A linear model that approximates the non-additive MGSA model by creating a single binary predictor whose value is set to 1 if a given gene is a member of any gene sets that have non-zero coefficients in the penalized SLPR model. The standard gene-level test statistics are then regressed on this binary predictor.
• Model assessment is based on a comparison of these two AIC values.

SLPR implementation
The SLPR method was implemented in the R statistical programming language [15] with the R glmnet package [9] employed for estimation of the LASSO penalized model specified in Eq (10). Since the glmnet package handles the bulk of the implementation complexity, the SLPR R code is succinct and a version is included in Section 3 of this SI. The SLPR R code, logic used to generate the simulation and real data results can also be downloaded from http://www.dartmouth.edu/˜hrfrost/SLPR.

Benchmark methods
For comparative evaluation of the SLPR method, the MGSA multiset method [4] and a simple competitive uniset method were used as benchmarks. For MGSA, the R implementation in the mgsa R package was employed and, for the uniset method, the geneSetTest function in the limma Bioconductor R package [16] was used. The geneSetTest method was executed with the following parameter settings: alternative="either", type="t", ranks.only=T. The mgsa method was executed with the observed state for each gene based on a dichotomized gene-level test statistic (see sections below for dichotomization details for the simulation and real data examples), the standard grid values for the p, α, and β parameters, restarts=5 and steps=1e6.
For gene set ranking, the posterior probability was used for MGSA and the -log(p-value) was used for geneSetTest.

Simulation study design
To assess the relative statistical performance of the SLPR method, 10 primary simulation studies were performed using a variety of activation models and two real gene set collections, as summarized in Table  S1 below. Additional simulation studies were also conducted to explore the impact of log-transformation of Z and dependence between the elements of Z. Details for the log-transformation and inter-gene correlation simulations is contained in Sections 2.1 and 2.2 below. For these simulations, two small-to moderate sized MSigDB [1] gene set collections were used to define the gene set indicator matrix A defined in Eq (1). Specifically, we used v5.0 of the MSigDB C2.CP.REACTOME collection (674 gene sets) and C5.CC collection (233 gene sets). These MSigDB collections contain gene sets from two well known and widely used repositories of curated gene sets: the Reactome pathway database [17] and the cellular component branch of the Gene Ontology pathway database [18] . For each simulation, a random proportion of the gene sets in the target MSigDB collection were deemed to be "active" and then gene-level summary statistics Z defined in Eq (8) were generated according to either a non-additive model (corresponding to the generative model in Eq (6)) or an additive model. In this context, additive implies that the summary statistic for a given gene is an additive function of the active gene sets in which the gene is a member. Likewise, non-additive implies that the summary statistic for a given gene is the same irrespective of the number of associated active gene sets. To support the MGSA method, the Z statistics were discretized using the thresholds specified in Table S1. The additive model took the form: with Z, A, and S as defined in Eqs (8), (1) and (2), µ was a fixed activation effect size, and ε is a vector of independent N (0, 1) random variables with the same length as Z. The non-additive model took the form: with T as defined in Eq (3) and Z, µ and ε as defined for the additive model above. As seen in Table S1, different parameters were varied in each simulation use case to explore the sensitivity of method performance to important model features. To highlight the non-random differences between the additive and nonadditive models, data was simulated for the two pairs of additive and non-additive models (models 1 and 2 and models 3 and 4) using the same randomly generated lists of true gene sets and the same simulated ε noise vectors. For each model, 250 data sets were simulated and tested using the SLPR, MGSA and geneSetTest methods. Method performance was evaluated in terms of how well each method could identify truly active gene sets as quantified by the area under the receiver operating characteristic curve (AUROC) computed on a ranked list of all gene sets in the tested collection. The R package ROCR [19] was used to plot the ROC curves. Due to the large size of typical gene set collections and standard focus during analysis on just the top portion of the ranked list of gene sets, we also computed partial area under the receiver operating characteristic curve (pAUROC) [20] using a false positive rate (FPR) upper limit of twice the proportion of true positives in the simulated data.

Real data analysis design
To evaluate the efficacy of the SLPR method on real genomic data, we performed gene set testing of lung adenocarcinoma and lung squamous cell carcinoma data from The Cancer Genome Atlas (TCGA) [ Low activity 5% 6 High activity 20% 7 Small µ 0.5 8 Large µ 1.0 9 Small Z thresh. Φ −1 (0.85) 10 Large Z thresh. Φ −1 (0.95) Table S1: MSigDB-based simulation models All models below the model 1 (the default) only show the differences from the model 1 settings. The columns have the following interpretation, #: number of the simulation model, name: descriptive name of the simulation model, MSigDB collection: name of the MSigDB gene set collection (v5.0) used to populate the A matrix, Z model: the statistical model used to simulate the gene-level summary statistics (either Eq (12) or Eq (13)), % active: proportion of gene sets that are truly active, µ: the mean of the summary gene-level statistic for active genes, and Z thresh.: the threshold used to discretize the Z i statistics for the MGSA method.
• TCGA mutation data -Source file: TCGA PANCAN12 mutation-2015-01-28.tgz -Data type details (from UCSC Cancer Browser documentation): "TCGA PANCAN12 (PANCAN12) somatic mutation data. Red (=1) indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene. White (=0) indicates that none of the above mutation calls were made in this gene for the specific sample. Somatic mutations calls (even on the same tumor DNA extract) are affected by many factors including library prep, sequencing process, read mapping method, reference genome used, genome annotation, calling algorithms, and ad-hoc pre/postprocessing such as black list genes, target selection regions, and black list samples. This dataset is the best effort made by the TCGA PANCANCER Analysis Working Group." Among the 437 lung adenocarcinoma subjects in the PANCAN12 data, just 325 had RNA-seq data and only 227 had both RNA-seq and gene-level non-silent mutation data. Among the 360 lung squamous cell carcinoma subjects in the PANCAN12 data, just 258 had RNA-seq data and only 178 had both RNA-seq and gene-level non-silent mutation data.
To enable gene set testing, the gene-level test statistic vector Z defined in Eq (8) was populated using the smallest estimated effect size for each gene that was within the 95% confidence interval (CI). Specifically, the value of z i for gene i was computed using the following procedure: 1. Perform a two-sample, two-sided t-test comparing the values measured for the gene in lung adenocarcinoma samples to the values measured in lung squamous cell carcinoma samples.
2. If the 95% confidence interval (CI) of the estimated mean difference included 0, z i was set to 0.
3. If the 95% CI of the estimated mean difference did not include 0, z i was set to the 95% CI value with the smaller absolute value, e.g., if both the upper and lower 95% CI values are negative, z i would be set to the upper CI value.

4.
A binary indicator variable for use with the MGSA method was computed as 1(z i = 0).
The gene-level statistics computed according to this procedure were then used to perform gene set testing relative to the MSigDB C2.CP collection using the SLPR, MGSA and geneSetTest methods. This approach ensured that the Z values were on the effect size scale while also taking into consideration the variance of the effect size estimates. It is important to note that this approach is distinct from the use of t-statistics or z-scores for Z, where the magnitude of the statistic directly corresponds to the p-value and can therefore be quite large even for tiny effect sizes if the gene measurements have low variance. Following this procedure, the Z values had the following interpretation for the expression and mutation data: • TCGA gene expression data Because the TCGA expression data is log2-transformed, the computed Z values represent log2-foldchanges for the expression of the target gene between lung adenocarcinoma and lung squamous cell carcinoma subjects. This has the implication that the SLPR regression model is linear on the log2fold-change scale or multiplicative on the fold-change scale. The estimated coefficients for each gene set in the SLPR model therefore correspond to a % change in the expected fold-change values when that gene set is active. One benefit of the SLPR method is that gene set selection performance is robust to log-transformation of outcome. So, even if an additive model on the fold-change scale is a better fit to the data than a model that is additive on the log-fold-change scale, the performance of the SLPR method will be similar. See Section 2.1 of the SI for simulation results that illustrate the good predictive performance of SLPR when the true model is linear but log-linear data is fit.
• TCGA mutation data The computed Z values in this case represent the difference between the proportion of lung adenocarcinoma subjects that have a non-silent mutation in the target gene and the proportion of lung squamous call carcinoma subjects that have a non-silent mutation in the target gene.
Some important limitations of this analysis include the subjective determination of biological relevance as well as the fact that only p-value rankings were taken into account for geneSetTest when typical review of uniset methods would examine all gene sets with significant adjusted p-values.

Real data concordance analysis
To assess how well each method was able to replicate the top-ranked gene sets, Kendall's coefficient of concordance [23] (as implemented in the R package irr ) was computed for each method and data type across the top 25 gene sets computed for five random and disjoint subsets of the lung adenocarcinoma using all of the lung squamous cell carcinoma samples. The partitioning was applied to just the adenocarcinoma samples given the smaller number of squamous cell carcinoma samples. Specifically, concordance was calculated using the ranks of the top 25 gene sets within the gene set results computed for all lung adenocarcinoma and squamous cell carcinoma subjects. For SLPR, if the number of gene sets with non-zero coefficients in the analysis for all lung adenocarcinoma subjects was less than 25, this smaller number was used for concordance computation.

Log-tansformed outcome
To assess the sensitivity of the SLPR model to a log-transformation of the outcome, data was simulated according to model 1 from Table S1 and a log-transformation was applied to the generated gene-level test statistics prior to execution of the evaluated multiset methods. This scenario thus represents a case where the true model is linear (i.e., the gene-level test statistics have a linear relationship with gene set activity) but a log-linear model is actually fit. As comparison of Figure 1 from the main manuscript with Figure S1 demonstrates that the performance of the SLPR method is robust to a log-transformation of the outcome.
The MGSA method, on the other hand, demonstrates severely degraded performance when applied to the log-linear data. Figure S1: Mean ROC curves for MSigDB-based simulation model 1 from Table S1 with log-transformed Z. Error bars on the ROC curves represent ±1 SE.

Inter-gene correlation
As currently formulated, the SLPR method assumes that the gene-level test statistics Z are independent. This assumption had several motivations: 1.
The assumption is consistent with the approach taken by existing multiset methods.
All current multiset methods (MGSA, GenGO, MCOA and MFA) also assume independent gene-level statistics.
2. If only summary statistics are available for each gene, an empirical estimate of the inter-gene covariance matrix will not be available.
Adjustment to account for the correlation through an approach such as generalized least squares is therefore not feasible when only gene-level test statistics are available.
3. The selection-based performance of SLPR should be insensitive to correlations among the gene-level test statistics when active and inactive gene sets have similar levels of inter-gene correlation.
The general insensitivity of gene set testing results in this scenario is a common feature of all competitive gene set testing methods. The performance of SLPR can also be expected to be robust in this case for the following reasons: • Although inter-gene correlation will impact the estimation of gene set statistics, both active and inactive gene sets will be impacted in a similar fashion so ranking should, in general, not be significantly perturbed.
• Since the gene-level tests statistics are the dependent variable in the SLPR regression model (9), correlations between these statistics is equivalent to the problem of correlated observations. As is well known in the regression literature, correlation between the observations impacts the estimated variance of the regression coefficients in linear models; the coefficient estimates themselves should remain unbiased even when correlated observations are assumed to be independent [24].
It is important to note that the performance of SLPR may be significantly impacted under certain scenarios, e.g., when the inter-gene correlation for non-active sets is larger than the inter-gene correlation for active sets.  Table S2: pAUROC results for MSigDB-based simulation models The partial area under the receiver operator characteristic curve (pAUROC) was computed using a false positive rate (FPR) upper limit of twice the proportion of true positives. This specific FPR limit for each model is specified in the FPR limit column and the ratio of the mean pAUROC to FPR limit is shown in the columns to the right. SLPR results are shown based on coefficients from both the penalized regression (LASSO) and from the unpenalized regression (OLS).

Results for simulation models 3 thru 10
This section contains figures illustrating the comparative performance of the SLPR, MGSA and geneSetTest methods on simulation models 3 through 10 as detailed in the "Simulation Design" Section of the main manuscript. Figure S2: Mean ROC curves for MSigDB-based simulation model 3 from Table 1 Table S3 contains the results from the model assessment test described in Section 1.2.5 on the various simulation scenarios. As seen in the table, the AIC value for the SLPR model is substantially smaller than the AIC value for the binary predictor model for all simulations scenarios (1, 3 and 5-10) where the gene-level test statistics were generated according to the SLPR model. The difference in AIC values is significantly smaller for the two simulation scenarios (2 and 4) where the gene-level test statistics were generated according the MGSA model. The AIC difference is also roughly proportional to the difference between AUROC values for SLPR and MGSA . Although this is only an approximate test, the results generally align with the true model demonstrating that this heuristic can provide useful information to researchers regarding the most appropriate multiset gene set method for a given data set.  4. PID TCPTP PATHWAY (Signaling events mediated by TCPTP): Associated with negative regulation of EGFR [42]. Expression and mutation of EGFR-related genes is known to differ between lung adenocarcinoma and squamous cell carcinoma [30,43].

Simulation model assessment results
5. BIOCARTA TEL PATHWAY (Telomeres, Telomerase, Cellular Aging, and Immortality): Telomerase activity is known to differ between small cell and non-small cell lung cancers [44] and is predictive of patient survival in NSCLC patients [45].   Table 2 in the main manuscript, 10 of the first 11 results directly match the 10 C2.CP gene sets with the largest overlap with the top gene set REACTOME CELL CYCLE MITOTIC, i.e., geneSetTest is consistently selecting gene sets related to the cell cycle.

TCGA concordance results
Based on our hypothesis regarding on the association between the two gene activation models and the gene expression and mutation data, we expected the SLPR method to have superior concordance relative to MGSA on the gene expression data and MGSA to have superior concordance relative to SLPR on the mutation data. The computed concordance results shown in Table S4 are consistent with this hypothesis. Note that the SLPR concordance for the mutation data was computed using just the top three gene sets since only three gene sets had non-zero coefficients in the analysis using all lung adenocarcinoma subjects. As a uniset method, we expected the top-ranked gene sets output by the geneSetTest method to be highly overlapping and to therefore capture only a fraction of the distinct and biologically plausible gene sets selected by either MGSA or SLPR. Due to the significant expected overlap among the top-ranked results from geneSetTest, we also expected misleadingly large concordance values for geneSetTest. The geneSetTest results shown in Tables S4 for the gene expression data are consistent with this hypothesis. For the top geneSetTest results shown in Table 2 of the main manuscript, 10 of the first 11 results directly match the 10 C2.CP gene sets with the largest overlap with the top gene set REACTOME CELL CYCLE MITOTIC (see Figure S10 for detailed overlap results). These highly redundant results explain the very high concordance value of 0.94, i.e., geneSetTest is consistently selecting gene sets related to the cell cycle. For the mutation data, on the other hand, geneSetTest had the lowest concordance value but this result is consistent with the much lower level of overlap among the gene sets and general lack of biological plausibility.
2.10 TCGA model assessment results Table S5 contains the results from the model assessment test described in Section 1.2.5 on the TCGA gene expression and mutation data. As seen in the table, the AIC values for the SLPR and binary predictor models are quite close for the mutation data, suggesting the MGSA model is most appropriate in this case. For the gene expression data, on the other hand, the SLPR AIC value is markedly lower than the AIC value for the binary predictor model, suggesting that SLPR provides a better fit to the data. While these relative AIC values provide only a very rough heuristic, they match our expectations regarding model fit for these data sets.  To illustrate the results on the TCGA example from a more sophisticated uniset method, we used the R implementation of the CAMERA method [49] from the limma package to analyze the same MSigDB collection (C2.CP) for TCGA lung adenocarcinoma vs. lung squamous cell carcinoma RNA-seq data. For this analysis, missing TCGA data elements were imputed using unconditional mean imputation and CAMERA was executed using default settings. The top ten C2.CP pathways generated by CAMERA are listed in the table S6. As seen in this table, the top results from CAMERA are distinct from those generated by geneSetTest, which is not surprising given the inter-gene correlation adjustment performed by CAMERA and the fact that the gene-level test statistics employed in the two cases are not identical. If CAMERA is executed assuming 0 inter-gene correlation, it does identify several cell cycle related sets among the top results and so overlaps with geneSetTest. In terms of the performance of CAMREA on this particular analysis, CAMERA fails to generate any significant findings (smallest FDR q-value is 0.45) and, for two of the gene sets of biological interest found by SLPR and discussed in the main manuscript (i.e., PID DELTA NP63 PATHWAY and PID TAP63 PATHWAY), CAMERA generated unadjusted p-values of 0.305 and 0.472 respectively.   }) + + #message("Total number of vars: ", length(col.sum)) + #message("Number of vars belonging to more than one non-zero set: ", length(which(col.sum > 0))) + + # Fit a model to the single binary predictor + binary.fit = lm("y~x",data.frame(y=design.mat[,1], x=binary.predictor)) + + results$binary.aic = AIC(binary.fit) + message("SLPR OLS AIC: ", results$slpr.aic, ", binary predictor AIC: ", results$binary.aic) + } + } else { + warning("Gauss-Lasso requested but no non-zero predictors!") +