signatureSearch: environment for gene expression signature searching and functional interpretation

Abstract signatureSearch is an R/Bioconductor package that integrates a suite of existing and novel algorithms into an analysis environment for gene expression signature (GES) searching combined with functional enrichment analysis (FEA) and visualization methods to facilitate the interpretation of the search results. In a typical GES search (GESS), a query GES is searched against a database of GESs obtained from large numbers of measurements, such as different genetic backgrounds, disease states and drug perturbations. Database matches sharing correlated signatures with the query indicate related cellular responses frequently governed by connected mechanisms, such as drugs mimicking the expression responses of a disease. To identify which processes are predominantly modulated in the GESS results, we developed specialized FEA methods combined with drug-target network visualization tools. The provided analysis tools are useful for studying the effects of genetic, chemical and environmental perturbations on biological systems, as well as searching single cell GES databases to identify novel network connections or cell types. The signatureSearch software is unique in that it provides access to an integrated environment for GESS/FEA routines that includes several novel search and enrichment methods, efficient data structures, and access to pre-built GES databases, and allowing users to work with custom databases.

LINCS GEP data. The Broad Institute has generated the LINCS GEP data with the bead-based L1000 assay for gene expression profiling. Since this technology is not widely used yet and pre-processing methodologies for its data are limited in the public domain, we have chosen to use the pre-generated data instances from the LINCS project directly rather than attempting to regenerate them from raw data. The GEP data from LINCS data can be downloaded from GEO in 5 different pre-processing levels (2). Level 1 data are the raw mean fluorescent intensity values that come directly from the Luminex scanner. Level 2 data are the expression intensities of the 978 landmark genes. They have been normalized and used to impute the expression of an additional set of 11,350 genes, forming Level 3 data. A robust Z-scoring procedure was used to generate differential expression values from the normalized profiles (Level 4). Finally, a moderated Z-scoring procedure was applied to the replicated samples of each experiment (mostly 3 replicates) to compute a weighted average signature (Level 5). For a more detailed description of LINCS' preprocessing methods, readers want to refer to the methods section in the corresponding publication by Subramanian et al., 2017 (2).
The differential expression data from LINCS used in this article are level 5 Z-scores. Since some GESS methods such as gCMAP and Fisher require gene sets in the reference database, Z-score cutoffs can be used to filter for sets of up-and down-regulated differentially expressed genes (DEGs). In this article, the corresponding up or down DEG sets were obtained with Z-score cutoffs of ≥ 1 or ≤ −1, respectively. In signatureSearch, these Z-score cutoffs can be assigned to filtering arguments to generate either query or database instances meeting the corresponding Z-score constraints. Examples of GS-DBs where this is relevant are those used by the gCMAP and Fisher GESS methods. In addition to using Z-score cutoffs, GS-Qs can also be extracted by specifying a fixed number of the most extremely upand down-regulated genes, such as the top 150 up-and down-regulated DEGs, respectively. Whether GS-Qs or GS-DBs instances were obtained by Z-score or number cutoffs is specified in the corresponding sections of the article. If the cutoff parameters deviate from the above default values then they are given as well. Examples of GESS function calls related to these routines are provided in the vignettes of the software and data packages of the signatureSearch environment. For instance, the subsection 'DEG and Cutoff Definitions' in the signatureSearchData vignette provides details on this topic.
CMAP2 GEP data This section provides a short overview of the CMAP2 data pre-processing steps to illustrate how this drug-perturbation GEP-DB could be used instead of LINCS for the performance test and proof-of-concept experiments included in this article. Both databases are supported by signatureSearchData, but for consistency we only used the LINCS database in the experimental sections. Since the Affymetrix GeneChip R technology used by CMAP2 is supported by a rich ecosystem of widely used analysis software, we generated the pre-processed and final data tables for this data set from the corresponding raw files (here CEL files), and deposited the results on Bioconductor's ExperimentHub for easy access with signatureSearchData. To compare the search results generated with the CMAP2 online service and the GESS methods from signatureSearch, we also included the CMAP2 rank matrix that is based on rank transformed differential expression values for all assayed genes. The latter can be downloaded from the CMAP2 project site. For the raw data processing from CEL files, normalized gene expression data were generated with the MAS5 algorithm (3). Next, the DEG analysis was performed with the limma package (4) using the experimental design table included in the CMAP2 data set to define replicates, as well as control and treatment samples. The statistical result tables generated by limma, including LFC values, pvalues and false discovery rates (FDR), were saved to the HDF5 files we are hosting on Bioconductor's ExperimentHub. These statistical values can be used by the query retrieval and GESS methods in signatureSearch to define DEGs with single or combinatorial cutoff parameters, such as DEGs that have an LFC value of ≥ 1 or ≤ −1, and an FDR of ≤ 0.01. Although the LINCS and CMAP2 result tables had to be generated with different statistical methods, one can filter in both cases for DEGs with cutoffs that can be applied to statistical values with comparable meaning (e.g. LFCs can be used instead of Z-scores). Detailed instructions along with the corresponding R code for creating the corresponding gene expression and statistical result tables are provided in the CMAP2 pre-processing sections of the signatureSearchData vignette. For instance, instructions for defining DEG sets with combinatorial filters of statistical parameters are given in the Supplement section of the vignette under 'DEG and Cutoff Definitions'.

S2
Additional details about FEA algorithms Duplication Adjusted Hypergeometric Test (dup_hyperG). The classical hypergeometric test assumes uniqueness in its gene/protein test sets. Its p-value is calculated according to (1) In case of GO term enrichment analysis the individual variables in equation (1) are assigned the following values. N is the total number of genes/proteins contained in the entire annotation universe, D is the number of genes annotated at a specific GO node, n is the total number of genes in the test set, and x is the number of genes in the test set annotated at a specific GO node. To maintain the duplication information in the test set used for TSEA, the values of n and x in the above equation are the corresponding gene counts including duplications.
Modified Gene Set Enrichment Analysis (mGSEA). The original GSEA method (5) uses predefined gene sets Ss defined by a chosen functional annotation system, such as GO or KEGG categories. The goal is to determine whether the genes in S are randomly distributed throughout a ranked test gene list L (e.g. all genes ranked by LFC), or enriched at the top or bottom of L. This is expressed by an Enrichment Score (ES) reflecting the degree to which a set S is overrepresented at the extremes of L. For TSEA, the test set L is a target set T associated with the top ranking drugs in a GESS result obtained from a drug-based GES database. Frequently, the corresponding gene identifiers in T are not unique, because several drugs in a GESS result may share the same targets. To account for the characteristic nature of GESS results, it is of utmost importance to maintain this duplication information as much as possible. To perform GSEA with duplication support, here referred to as mGSEA, the target set T is transformed to a score ranked target list L tar of all targets included in the corresponding annotation system. For each target in T , its frequency is divided by the number of all targets in T (including duplications), which is the weight of that target. For targets present in the annotation system but absent in the target set T , their scores are set to 0. Thus, every target in the annotation system will be assigned a score. Subsequently, the target list will be sorted decreasingly to obtain L tar . Importantly, the original GSEA method cannot be used for TSEA directly since zeros are very frequent in L tar . As a result, the sum N R can become zero too which cannot be used as the denominator in equation (2) from Subramanian et al. (2005). To avoid this problem, the affected ES values are ignored by assigning -1 as a tag.
If only some genes in set S have scores of zeros then the value of N R is increased according to equation (3). The latter adds to N R the minimum value of the non-zero gene scores in S multiplied by the number of genes in S that have scores of zero. Increasing N R can in return decrease the weight of the genes in S that have non-zero scores. To compensate for this, the mGSEA algorithm computes N R according to equation (3) instead of equation (2). P hit (S,i) in equation (2) evaluates the fraction of genes in S ("hits") weighted by their scores present up to a given position i in L tar , where r j is the score of gene j in L tar . Typically, the exponent p is set to 1 in order to weight the genes in S by their scores in L tar .
The motivation for the above modifications is that if only a small number of genes in set S has non-zero scores and these genes rank high in L tar , the weight of these genes will be close to 1 resulting in an ES(S) of close to 1. Thus, the original GSEA method would score the gene set S of a functional category as significantly enriched. However, this is undesirable because in this example only a small number of genes is shared among the test target set T and the gene set S of a functional category. To avoid this, small weights are assigned to genes in S that have scores of zero. The latter decreases the weight of the genes in S that have scores other than zero, thereby decreasing the false positive rate. Finally, the functional categories (gene sets Ss) are ranked by ES from highest to lowest, where the top ranking ones are favored as enriched GO terms and KEGG pathways.
MeanAbs (mabs). The input for the MeanAbs method is L tar , the same as for mGSEA. In this enrichment statistic, mabs(S), of a gene set S is calculated as mean absolute scores of the genes in S (6). In order to adjust for size variations in gene set S, random permutations (e.g. π = 1000) of L tar are performed to determine mabs(S,π). Next, mabs(S) is normalized by subtracting the median of the mabs(S,π) and then dividing by the standard deviation of mabs(S,π) yielding the normalized scores N mabs(S). Subsequently, the portion of mabs(S,π) that is greater than mabs(S) is used as nominal p-value. Finally, the resulting nominal p-values are adjusted for multiple hypothesis testing using the Benjamini-Hochberg method (7).

Filtering of MOA and SSC Categories
The 276 MOA categories were downloaded from the Touchstone database. They were associated with at total of 1,555 compounds. Since not all MOA categories are expected to contain drugs that induce similar gene expression changes, MOA categories predominantly associated with dissimilar GESs were eliminated by a filtering process based on recall rates that were averaged across all six GESS methods. For this, the GESs associated with drugs belonging to a MOA category were searched iteratively against the LINCS database. For each query result, the rankings of the GESs belonging to the same MOA category as the query were recorded. The joined ranking results for all queries of a MOA were then summarized using the mean of the ranks, and the mean rank percentile was set as the recall rate of a MOA for the corresponding GESS method. To make sure none of the six GESS methods had been given an unfair advantage in this selection process, the MOA level recall rates were combined by calculating the mean of the recall rates across all six GESS methods. The latter was used for the final ranking of the MOA categories. Subsequently, the top 25% ranking MOA categories were used for the GESS performance tests described in the main text of this article. The final set included a total of 69 MOA categories associated with 309 compounds. The filtering of the SSC categories was performed the same way as the filtering of the MOA categories.   Table S2. GESS methods applied to MOA categories. To assess whether the observed performance differences are statistically significant for all pair-wise comparisons, the bootstrap method was used for both the global AUC and pAUC metrics. The BH method was used for multiple testing correction (7). The columns contain: GESS method 1 a ; GESS method 2 b ; P-value c and adjusted P-value d .