Application of Transcriptional Gene Modules to Analysis of Caenorhabditis elegans’ Gene Expression Data

Identification of co-expressed sets of genes (gene modules) is used widely for grouping functionally related genes during transcriptomic data analysis. An organism-wide atlas of high-quality gene modules would provide a powerful tool for unbiased detection of biological signals from gene expression data. Here, using a method based on independent component analysis we call DEXICA, we have defined and optimized 209 modules that broadly represent transcriptional wiring of the key experimental organism C. elegans. These modules represent responses to changes in the environment (e.g., starvation, exposure to xenobiotics), genes regulated by transcriptions factors (e.g., ATFS-1, DAF-16), genes specific to tissues (e.g., neurons, muscle), genes that change during development, and other complex transcriptional responses to genetic, environmental and temporal perturbations. Interrogation of these modules reveals processes that are activated in long-lived mutants in cases where traditional analyses of differentially expressed genes fail to do so. Additionally, we show that modules can inform the strength of the association between a gene and an annotation (e.g., GO term). Analysis of “module-weighted annotations” improves on several aspects of traditional annotation-enrichment tests and can aid in functional interpretation of poorly annotated genes. We provide an online interactive resource with tutorials at http://genemodules.org/, in which users can find detailed information on each module, check genes for module-weighted annotations, and use both of these to analyze their own gene expression data (generated using any platform) or gene sets of interest.

gene expression data analysis microarray functional annotation of genes gene co-expression independent component analysis aging mitochondrial unfolded protein response respiration hif-1 atfs-1 Nearly half of the predicted protein-coding genes in Caenorhabditis elegans lack a functional annotation based on direct experimental evidence (Lee et al. 2018). As a result, querying gene annotation databases, such as the Gene Ontology (GO) or curated pathways can fail to detect biologically meaningful signals in gene expression data (Ashburner et al. 2000;Fabregat et al. 2017;Kanehisa et al. 2012). An alternative approach to understanding gene function is to use information about gene expression. Gene-expression data can be used to define groups of genes that show similar patterns of expression, or co-variation, across many different conditions, which arise due to changes in genes' transcriptional activity and/or their post-transcriptional regulation. These groups are called transcriptional gene modules, with each module potentially representing a discrete biological phenomenon. Gene modules are routinely constructed when clustering algorithms are applied to gene-expression data and have been used successfully to identify gene regulatory mechanisms in a variety of contexts, from the yeast cell cycle (Wu et al. 2006) and sporulation (Wang et al. 2005) to human cancer cells (Teschendorff et al. 2007) to cognitive decline in patients with Alzheimer disease (Mostafavi et al. 2018). Furthermore, large compendia of data sampling diverse perturbations have been used to define fundamental gene-expression programs of entire organisms (Hughes et al. 2000, Engreitz et al. 2010, Zhou and Altman 2001. In C. elegans, the most recent effort to generate high-quality fundamental transcriptional modules is now almost two decades old (Kim et al. 2001). Co-expressed genes were grouped together into 43 groups (or "mountains") based on their correlation across 553 microarrays. However, the compendium did not contain all genes (e.g., over 30% of the microarrays only contained $11,000 of C. elegans' 20,470 protein-coding genes) and each gene was assigned exclusively to one group, although it is well-established that genes can participate in multiple processes (Oeckinghaus et al. 2011;Gerstein et al. 2012). The number of perturbations for which gene expression data are available has also increased substantially since 2001.
Here, we define 209 transcriptional gene modules in C. elegans using a heterogeneous compendium of 1386 microarrays and a method we call DEXICA, for Deep EXtraction Independent Component Analysis. DEXICA builds on prior implementations (Lee and Batzoglou 2003;Gong et al. 2007;Engreitz et al. 2010) of independent component analysis (ICA) for gene module extraction by maximizing the biological information content of the modules. It does so by varying data pre-processing methods and the number of extracted ICA components (modules) until the number of significantly enriched biological annotations in the module set is maximized. DEXICA also uses an artificial neural network to partition each independent component to determine which genes should be included in each module.
We show that the 209 DEXICA C. elegans modules capture gene expression patterns that correspond to biological processes; for example, responses to heat stress, xenobiotics and pathogenic bacteria, and to several individual tissues. Furthermore, data analysis in the module space correctly reveals biological processes that are missed by analyses of differentially expressed genes. We provide a user-friendly web interface, with tutorials, in which users can test which of the 209 gene modules are active in their datasets and find detailed information about each module that helps determine which biological process(es) they represent.
Finally, we explore whether gene modules can be used to improve existing gene annotations. We reason that an annotation shared among co-expressed genes is more likely to be relevant to their function than one that is not, and that annotations of co-expressed genes can provisionally be "transferred" onto their poorly annotated companions. We calculate what we call "module-weighted gene annotations" by weighting the association between a gene and an annotation by the degree to which the annotation appears predictive of the gene's module membership. We show that matrix-based analysis of module-weighted annotations is more sensitive and specific than common annotation enrichment tests. We provide a framework for using module-weighted annotations to detect significant GO terms and promoter oligonucleotides directly from expression data, and to identify novel GO terms conferred onto genes based on their module membership.
The first part of this paper, which describes the construction and biological validation of the C. elegans DEXICA modules, is directed to computational biologists, and the second to C. elegans biologists who may or may not have experience in computational biology. Readers simply wishing to apply this analytic method to specific gene lists should feel free to skip directly to the later sections but may also enjoy reading about the performance of the modules in our various validation test cases.

Compendium construction
To build the compendium of 1386 C. elegans Affymetrix arrays, we first downloaded all CEL files with the appropriate platform ID (GPL200) from the GEO database available on March 1, 2014, excluding those for which the organism was not C. elegans and the sample type was not RNA. We excluded arrays from experiments for which fewer than 8 hybridizations were performed in order to mitigate the effect that under-sampled conditions might have on predicted modules. We then performed a quality control step using the quality assessment functions provided in the simpleAffy (v2.40.0) R package (http://bioinformatics.picr.man.ac.uk/simpleaffy/), discarding arrays that did not meet the quality thresholds recommended in the simpleAffy documentation.
We generated expression values for probesets separately for each experiment (determined by GEO series IDs) using the RMA preprocessing procedure provided in the affy (v1.40.0) R package (Gautier et al. 2004), then used the bias (v0.0.5) R package (Eklund and Szallasi 2008) to remove intensity-dependent biases in expression levels. We then concatenated the expression matrices for each experiment into a single matrix. Next, we either performed between-experiment quantile normalization on the entire matrix using the limma (v3.18.13) R package (Smyth 2005), or omitted this step, depending on the preprocessing method to be tested. Finally, we scaled and centered the arrays, such that the mean of each column was zero and the standard deviation was 1.

Conducting ICA
To conduct ICA of the gene expression matrix, we used the fastICA (v1.2-0) R package (http://CRAN.R-project.org/package=fastICA) with default parameters except for the "method" parameter, which we set to "C" to increase computational speed, and the "row.norm" parameter, which we set to "TRUE" in order to balance the total compendium variance between genes with subtle changes in expression values and those with large changes in expression values.

Partitioning of independent components
To convert independent components to discrete sets of genes, we employed two methods. In the first, for each component, we assigned all genes with a weight .= 3 to the positive ("a") hemi-module, and all genes with a weight ,= -3 to the negative ("b") hemi-module. In the second, we created an artificial neural network using the neuralnet (v1.32) R package (http://CRAN.R-project.org/package= neuralnet) to predict positive and negative partitioning thresholds for each independent component, based on the component's skewness and kurtosis (see Supplemental Methods), then assigned genes whose weights exceeded these thresholds to the corresponding hemimodules.
Obtaining gene annotations and microarray data For module optimization, we obtained GO term and REACTOME pathway annotations for C. elegans genes using biomaRt (v2.18.0) R package (Durinck et al. 2009) and ensembl mart for data retrieval. To obtain genes' tissue annotations, we downloaded all available data from the GFP Worm database (http://gfpweb.aecom.yu.edu/) (Hunt-Newbury et al. 2007), which contains annotated expression patterns of promoter::GFP fusion constructs; in total, this dataset provided annotations for 1821 genes across 89 tissue types (n.b., we considered the same tissue in different development stages to be distinct tissue types). For Module Annotation Pages (Tool 3 on http://genemodules.org/), known tissues (anatomies) of gene expression were downloaded from the "expression patterns" database from Wormbase (WS275). This dataset included 5376 genes and 2438 anatomical terms. GO term enrichment was calculated using topGO(v.2.34.0) and celegans.db (v 3.2.3) R packages. To obtain fold changes for isp-1 mutants, we used data previously published by our group in which isp-1(qm150) mutants were compared to wild type controls (Cristina et al. 2009). To obtain fold changes for hif-1 mutants, we used the maanova (v1.33.2) R package (http://research.jax.org/faculty/churchill) and data previously published by Shen, et al. (Shen et al. 2005), to calculate the induced gene fold changes upon mutation of hif-1. All other microarray data were obtained directly from the authors of the original publications or from the GEO database (see Supplemental Methods).
Optimizing gene module prediction To optimize gene module prediction, we performed ICA with a variety of different data preprocessing options [e.g., the choice of preprocessing algorithm (RMA, GCRMA, PLIER, MAS 5.0), background, perfect match, bias correction, and normalization methods], and with a varied number of extracted components from 5 to 500 by increments of 5. For each parameter combination, we repeated ICA 5 times.
We tested the biological validity of the independent components generated by each ICA run by determining the number of annotations that were enriched in at least one hemi-module. We chose not to optimize based on the number of modules with at least one significant annotation, as tests using simulated data showed that this approach could lead to signals being split into multiple, less accurate representations (data not shown). We first calculated a p-value for the enrichment of genes associated with each annotation term in each hemi-module using the hypergeometric test. We then applied the Simes method (Simes 1986) for multiple hypothesis testing (alpha = 0.05) to the set of p-values for each annotation term. The Simes method is similar to the Benjamini-Hochberg method (Benjamini and Hochberg 1995) for controlling the false discovery rate, but differs in a way that makes it more appropriate here: it aims to answer the question, "Given a set of p-values, what is the likelihood that at least one null hypothesis is false?", while the Benjamini-Hochberg method asks, "What fraction of rejected null hypotheses are actually true (i.e., falsely discovered)?" Failure of the Simes test indicates that at least one null hypothesis is false at the specified alpha level. To verify the accuracy of our module quality statistics, we repeated all tests using module definition matrices in which gene IDs had been randomly shuffled.

Quantification of module activity in gene expression data
To project a data vector, x, such as a set of gene-expression fold changes, onto a set of gene modules, we used the scalar projection method, in which a mixing vector, a, is calculated from the dot product of the data vector and the unit vectors comprising the module definitions,Ŝ, as shown in equation 2: The resulting mixing vector, a, provides an indication of the weight of each module definition vector in the projected data, x. Note that because x is a vector of gene fold changes, this analysis can be applied to data generated using any RNA profiling platform (e.g. RNAseq, microarray). Projection of a data matrix, X, which generates a mixing matrix, A, was carried out using the same procedure.
To calculate signed variance explained (SVE), we calculated the relative variance explained (VE) for each module from a as follows, where n is the total number of modules: We then multiplied these values, which are strictly positive, by -1 in each case where a i ,0 to obtain SVE.
Generation of the Enrichment matrix, E First, ANN-based partitioning of the module definition matrix, S, was used to generate matrix S p . S p contains two gene sets per independent component (which we refer to as hemi-modules), for a total of 418. Using a matrix of known Boolean associations between genes and annotations, B, we calculated a hypergeometric probability for each annotation in each hemi-module. We used the frequency of genes bearing a particular annotation in the hemi-module, the frequency of such genes in the compendium, the number of genes in the hemimodule, and the number of genes not in the hemi-module as the q, m, k, and n input parameters, respectively, to the phyper() function of the stats (v3.0.3) R package (http://www.R-project.org/). We used these p-values to populate a matrix, E, with a row for each hemi-module and a column for each annotation. For underrepresented annotations, we entered the log(p-value) in the matrix, and for over-represented annotations we entered the -log(p-value).

Generation of module-weighted annotations
Our calculation of module-weighted annotations takes advantage of the fact that, in modules generated by ICA, prior to partitioning, each gene has a weight in each module. Given a score, or weight, for each annotation in each module, this allows genes to be associated with annotations via a matrix product calculation. In the E matrix (see above) highly positive values correspond to strongly enriched annotations in a hemi-module, highly negative values correspond to strongly depleted annotations. We transform the gene module matrix, S, into an unpartitioned hemi-module matrix, H, by concatenating it with a negative copy of itself column-wise: The product of this matrix with the E matrix produces matrix R, which relates genes to annotations: As a final step, we normalize the values in this matrix row-wise; i.e., separately for each annotation, by subtracting the mean and dividing by the standard deviation.

Analysis of GO terms based on gene weights
To test whether the ranking of GO terms based on gene weights (Table 1) was statistically significant, we constructed a list comprising all semantic words used in all GO terms, excluding words shorter than three letters and uninformative words (e.g., "the" and "for".) We then tested each word for bias toward appearing near the top or bottom of the ranked GO term list. In agreement with our initial observations, the most significantly top-biased words pertained to macromolecular complexes, such as "nucleosome", "cilium", and "ribosomal", and the most significant bottom-biased words pertained to cell signaling, such as "signal", "kinase", and "receptor". Many of the GO terms containing cell signaling words were generic in nature, e.g., protein kinase regulator activity, thus, our results may partially be explained by a lack of co-regulation among constituents of different signaling pathways. However, some specific cell signaling terms, e.g., Notch signaling pathway also appeared near the bottom of the ranked GO term list, suggesting that the genes annotated with such terms are either not strongly co-regulated at the gene expression level or that the biological conditions represented by the compendium did not perturb their expression enough to form modules with our method.

Analysis of expression data with moduleweighted annotations
To test whether a set of gene fold-changes was significantly enriched for specific annotations, given a module-weighted annotation matrix, R, we calculated the dot product of the data vector, x, comprising the set of gene fold-changes, and the R matrix: The resulting vector, a, provides an indication of the degree to which genes with strong weights for each annotation also have strong fold-changes. To generate p-values from these, we permuted the fold-change vector, x, 1000 times to create a background distribution for each annotation, which we then used to determine z-scores.

Data Availability
All data supporting the findings of this study are available within the paper and its supplemental materials. Dynamic access to the data, with tutorials, is also provided through a Shiny App graphical user interface at http://genemodules.org/. Specifically, 6 separate functionalities are provided: Tool 1: Determine which of the 209 modules defined here are active in the user's gene expression data (by inputting a list of gene expression fold changes) Tool 2: Using the partitioned 209 modules as gene sets, test whether a list of user-specified genes is enriched in any particular module Tool 3: Detailed description of each of the 209 C. elegans gene modules Tool 4: Visualize how genes assigned to a module change under a variety of conditions (conditions can be chosen from the 716 perturbations derived from our microarray compendium or provided by the user) Tool 5: Get module-weighted GO terms associated with a userspecified gene Tool 6: Analyze gene expression data using module-weighted GO terms Supplemental materials include Supplemental Discussion and Supplemental Methods, Supplemental References, eight Supplemental Figures and four Supplemental Files. File S1, "Partitioned S matrix", contains probe sets and genes that belong to each of the 209 DEXICA modules. File S2, "Module annotation summaries", provides a summary of information about each module -top enriched GO terms and top activating perturbation. File S3, "Finding experiments that activate similar modules" contains a comparison of module activity in all possible pairs of perturbations (contrasts) in the microarray compendium to facilitate identification of perturbations that activate similar modules. File S4, "Module-weighted GO terms", contains module-weighted GO annotations of each gene. Supplemental material available at figshare: https://doi.org/10.25387/ g3.12061656.
Code described in this study is available as an R package: DEXICA R package https://github.com/MPCary/DEXICA. Data to reproduce extraction of gene modules can be downloaded from https://github.com/MPCary/DEXDATA.Celegans.

Development of DEXICA and extraction of optimized gene modules
A large body of gene-expression data is publicly available (Barrett et al. 2011;Rustici et al. 2013) and has enabled computational prediction of gene modules (Kim et al. 2001;Segal et al. 2003a;Ihmels et al. 2004;Michoel et al. 2009;Engreitz et al. 2010). We refer to our method for going from a raw compendium of gene expression data to an optimized set of gene modules and a list of genes that belong to each module as DEXICA, for Deep EXtraction Independent Component Analysis (described below).
While several methods exist for defining gene modules, independent component analysis (ICA) generally outperforms clusteringbased approaches and principle component analysis in extracting biologically relevant signals from large datasets 15,22-24 . Advantages of ICA include its ability to deal well with high dimensional data and to generate modules that can share genes. Furthermore, ICA does not assume that latent signals in the data follow a Gaussian distribution, an important property for gene module prediction, as gene regulation signals appear to be primarily super-Gaussian (Purdom and Holmes 2005). For these reasons, we chose to use ICA for module construction.
Briefly, ICA is a blind source separation method that attempts to "unmix" a signal comprising additive subcomponents by separating it into statistically-independent source signals (Hyvärinen and Oja 2000). In the common notation, an n x m data matrix, X, is decomposed into two new matricesan n x k source matrix, S, and a k x m mixing matrix, A, where k is the number of independent components: In the context of gene-expression analysis, X is a matrix of m measurements (e.g., microarrays) of n genes, and k independent components are interpreted as gene modules. A indicates the weight of each module in each microarray and S indicates the relative level of inclusion of each gene in each module (Kong et al. 2008) (Figure 1a). Using simulated data, we found that module-prediction accuracy was highest when the number of extracted components matched the true number of modules ( Figure S1). Therefore, we sought to optimize module prediction by evaluating results based upon their biological information content, such as enrichment of Gene Ontology (GO) terms (Ashburner et al. 2000), REACTOME pathways (Fabregat et al. 2017), and tissue-specific expression (Hunt-Newbury et al. 2007). We applied ICA to a diverse compendium of 1386 C. elegans Affymetrix arrays obtained from the Gene Expression Omnibus (GEO) database (Barrett et al. 2011). We then compared results obtained from a wide variety of preprocessing methodologies and the total number of components extracted. Omitting between-experiment quantile normalization from the preprocessing procedure produced modules that were more annotation rich than those produced by a published implementation of ICA-based module extraction (Engreitz et al. 2010) (Figure 2a-c). Furthermore, annotation content of the modules was maximized when the number of gene modules (i.e., independent components) ranged from 191 to 226.
To enable functional evaluation of the modules, it is useful to summarize them as discrete gene sets. To this end, we partitioned each column of the S matrix into three sets of genes: one set consisting of genes excluded from the module, and two other sets consisting of genes implied to be regulated in opposite directions. We refer to these latter two sets as "hemi-modules" "a" and "b", one set consisting of genes with highly positive weights and the other consisting of genes with highly negative weights (signs assigned based on skewness) in the independent component. Thus, module genes regulated in one direction (up or down) are part of one hemi-module and genes regulated in the opposite direction (down or up) are part of the other hemi-module. The "a" hemi-module is derived from the more highly skewed side of the independent component, and is usually larger (i.e. contains more genes). While others have used a fixed-threshold approach to component partitioning (Lee and Batzoglou 2003;Chiappetta et al. 2004;Engreitz et al. 2010); for example, defining genes with weights exceeding +/2 3 standard deviations from the component mean to be "in" each hemi-module, we found that different individual modules showed maximum annotation enrichments at different thresholds, suggesting that a 'one-size-fits-all' approach to partitioning was sub-optimal. An alternative approach to partitioning that we attempted (described in Frigyesi et al. 2006) failed to converge in many cases (data not shown). Therefore, to increase partitioning accuracy, we trained a function to predict partitioning thresholds from the shape of component distributions. Because plots of training data revealed a complex solution surface, we decided to use an artificial neural network (ANN) to predict partitioning thresholds for each component from the skewness and kurtosis of its distribution. The output of the ANN, i.e., predicted partitioning thresholds, is shown in Figure S2a. Although using a fixed-threshold approach to module partitioning produced similar results qualitatively ( Figures S2b-d), it resulted in fewer significant annotations across the range of parameters tested than did ANNbased partitioning (P ,2.2E-16, Figure S2e  A matrix of gene expression data, X, is decomposed using independent component analysis (ICA) into a gene module definition matrix, S, and a matrix containing the weight of each module in each microarray, A. Rows of the S matrix indicate the relative degree of inclusion of a given gene in each module. (b) Annotation enrichment within modules is calculated using matrices B and S p . Known associations between genes and annotations are captured in a Boolean matrix B, where 1 indicates an association between a gene and an annotation and 0 indicates a lack of an association. S is partitioned into S p , where 0 indicates a gene's exclusion from a module, while -1 indicates a gene's assignment to the "a" hemi-module and 1 to the "b" hemi-module (a hemi-module comprises genes that have extreme weights and the same sign in a given column of the S matrix). Enrichment of genes associated with each annotation in each hemi-module is calculated using hypergeometric tests and log(p-values) (for under-represented annotations) or -log(p-values) (for overrepresented annotations) are recorded in a new matrix, E. (c) To generate module-weighted annotations, the S matrix is first transformed into a matrix, H, which indicates the weight of each gene in each hemi-module. H contains twice as many columns as S because there are two hemi-modules per module. The dot product between H and E results in the R matrix, where a row indicates the weighted association between a gene and each annotation. The color intensity in R indicates annotation weights that would result given the indicated expression values in X and the Boolean annotations in B.
Color saturation in all matrices except B (Boolean) and S p (trivalued) indicate relative numeric values, with least saturated colors (i.e., white) indicating highly negative values, and most saturated colors indicating highly positive values.

Figure 2
Optimization of gene modules. To determine the optimal preprocessing method and the optimal number of components (gene modules) to extract from a gene expression compendium of C. elegans microarray data, we calculated the number of Gene Ontology terms (a), C. elegans tissues (b), and REACTOME pathways (c) that were significantly enriched in at least one gene module. Black points show results from a compendium produced using a preprocessing procedure used by Engreitz et al. (Engreitz et al. 2010); red points show results for the best alternative preprocessing method that we tested. Black dashed lines indicate the point on the x-axis of each graph at which loess regression curves showed the greatest difference between red points and results from randomized controls (gray points). (d) The number of modules produced by DEXICA, by a different ICA-based method used by Engreitz et al. (Engreitz et al. 2010), and by C. elegans gene expression topomap generated by Kim et al.(Kim et al. 2001). The number and the percentage of total predicted modules that have significant enrichment for various annotations are shown. Error bars indicate s.d. between repeat runs of DEXICA or Engreitz et al. method. (e) Distribution of the number of probe sets partitioned to each of the 209 DEXICA modules. (f) Distribution of the number of different DEXICA modules to which a given probe set is partitioned. (g) Each of the 418 DEXICA hemi-modules was tested for enrichment of genes with different structural propertiespresence a long 39-UTR, transcription as part of an operon, and presence of multiple annotated splice variants.
Gene modules are expected to represent sets of genes that are co-regulated at the level of mRNA expression or stability. Therefore, DEXICA modules should be enriched for DNA and RNA regulatory sequences. To test this, we generated a list of potential regulatory oligonucleotide sequences (called 'words') by applying the Mobydick algorithm (Bussemaker et al. 2000) to the set of all predicted C. elegans promoter regions and, separately, to the set of all predicted C. elegans 39-UTRs. We then calculated the statistical significance of the over-or under-representation of genes bearing each word in each gene module. Across multiple runs with 209 components, the mean number of gene modules containing significant promoter words and 39-UTR words was 106.3 and 40.6, respectively, significantly greater than results produced by other module prediction methods we tested (P ,2.2E-16, Figure 2d).
Because the ICA algorithm employed by DEXICA, fastICA, converges to a final solution from a random starting point (Hyvärinen and Oja 2000), small differences typically exist in the output of different runs; these differences can be seen in the vertical spread of data points in Figures 2a-c, and in the error bars in Figure 2d. While others have reconciled such differences through a clustering approach applied to the output of numerous runs of the algorithm (so called "iterated ICA") (Frigyesi et al. 2006;Engreitz et al. 2010), when applied to the C. elegans Affymetrix compendium, many of the final components generated by this method were highly correlated to one another, indicating non-independence and potential redundancy among the components (data not shown). We therefore sought to choose a single, high quality, fastICA run output to use as predicted gene modules. Because we considered word enrichment the most unbiased measure of module quality (as it relies only on DNA sequence data), we chose as our final module set (File S1) the run from a set of 100 with the best combined rank of significant promoter and 39-UTR words (it ranked first in promoter words and third in 39-UTR words, Figure S3). The final median module size is 319 probe sets (Figure 2e) with half of all probe sets belonging to at least 3 modules (Figure 2f). Only 4% of all genes (729 genes, 1189 probe sets) do not belong to any module. These genes show significant enrichment for germline or embryonic expression.
Co-expressed genes often share common structural elements. For example, genes within operons are switched on together during recovery from growth-arrested states in C. elegans (Zaslaver et al. 2011) and 39-UTR length is associated with proliferation in cancer cells (Mayr and Bartel 2009). To further explore the information content of DEXICA-extracted modules, we tested each hemi-module for over-and under-enrichment of genes with long 39UTRs, for genes appearing in operons and for genes with multiple splice forms. Of the 418 hemi-modules, 65 contained a significant bias toward long 39-UTR genes and 58 contained a bias toward short 39-UTR genes (q , 0.1, threshold chosen based on randomized control trials, see below; Figure 2g). Twenty-one hemi-modules were significantly enriched and 205 hemi-modules were significantly depleted for operon genes, and 81 hemi-modules were enriched and 80 hemi-modules were depleted for genes with multiple splice variants (Figure 2g). Control tests performed on the same module set but with randomly scrambled gene IDs produced no significant modules below q = 0.1 for any of the gene properties we tested (data not shown). Therefore, DEXICA successfully groups genes with common structural features into modules, consistent with the known relationship between gene structure and expression.
Together, these results show that the 209 C. elegans gene modules extracted from a large microarray compendium are enriched for GO terms, tissue and pathway annotations as well as for potential regulatory DNA sequences and gene structural properties. DEXICA is available as an R package (https://github.com/MPCary/DEXICA). It provides tools to optimize ICA module extraction and partitioning based on annotation enrichment and can be applied to any gene expression compendium.

C. elegans DEXICA modules are biologically informative
We observed enrichment of functional gene annotations and oligonucleotide sequences in DEXICA-extracted modules with even the least optimal parameter settings (see Figure 2). To further test the biological significance of the final 209 modules and to begin annotating them, we constructed an alternate microarray compendium comprising not 1386 individual arrays, as in the original compendium, but rather the gene fold changes arising from contrasting experimental and control samples in the same experiment. Projecting the resulting 716-column matrix (one for each contrast) into the space defined by our gene modules allowed us to see which experimental perturbations activate or inhibit which modules.
We compared the most enriched GO terms in each module to the most strongly activating or inhibiting experimental perturbations and observed that in several cases these were in obvious agreement (File S2 and Tool 3 on http://genemodules.org/). For example, the strongest activation of m73 (module 73) is achieved by an embryonic development time course experiment. Consistent with this, the top GO terms enriched within genes that comprise m73 describe cell division and the cell cycle. Module 153 is strongly activated by a comparison of L3 lethargus (a larval period prior to molting) to active L3 larvae and, accordingly, this module is enriched for GO terms that describe cuticle remodeling. Similarly, m200 is activated in response to the pathogen P. aeruginosa and the top GO terms enriched within m200 genes include "defense response to Gram-negative bacteria" and "innate immune response".
DEXICA modules can also capture gene expression patterns that define specific tissues. For example, m23 is strongly activated by GFPbased enrichment of neurons (GSE8004). The top GO term associated with m23 is "neuropeptide signaling pathway" and genes in this module have highly significant associations with neuronal anatomy terms. Similarly, m144 is strongly activated by genes specific to embryonic muscle cells (GSE8462) and is enriched for GO terms such as "striated muscle dense body" and anatomies such as "body wall musculature". Together, these results indicate that DEXICA modules represent biologically meaningful sets of genes.
It is very important to note that a lack of an obvious agreement between the nature of a perturbation that strongly activates a module and the top GO terms enriched in that module does not necessarily indicate that a module is biologically meaningless. On the contrary, this apparent disagreement can be a powerful tool for understanding complex transcriptional responses to perturbations. For example, starvation of L1 larvae induces a developmental arrest and has widespread effects on metabolism and energy allocation (Baugh 2013). Module 51 is strongly activated by L1 starvation and is enriched for GO terms that describe ribosome biogenesis (e.g., "ribosome biogenesis", "nucleolus"), autophagy (e.g., "autophagosome maturation") and response to starvation (e.g., "cellular response to nitrogen starvation"). Thus, this module represents several coordinated processes that become activated in response to larval starvation. Obvious agreement between a perturbation and GO terms would also be absent if a process had not previously been associated with a particular condition, making modules useful for hypothesis generation. For example, a wild C. elegans isolate, JU1580, shows strong activity of m55 relative to the laboratory C. elegans strain, N2, and genes within m55 are enriched for the "innate immune response" GO term. This leads to a hypothesis that the immune systems of JU1580 and N2 have different levels of basal activity.
Application of DEXICA modules to improve gene annotations Because gene annotations can be incomplete and biased, knowledge about which genes tend to be co-expressed in an organism can help inform which annotations are more likely to be biologically relevant and help infer annotations of orphan genes based on annotations of other module members (see Supplemental information for additional discussion). To this end, we devised a calculation of scores, which we call "module-weighted annotations", based both on how much a given annotation is enriched in each module (matrix E, Figure 1b-c) and the degree to which a given gene belongs to that same module (matrix H, Figure 1c).

Module-weighted GO terms:
To test the utility of module-weighted annotations, we first examined whether they generally recapitulate traditional (Boolean) annotations. If so, then genes traditionally associated with each annotation should have larger weights for those annotations (when normalization is omitted from their calculation) than do other genes (two-sample KS test, alpha level = 0.05). For this analysis we used GO terms as annotations and restricted the set to terms with at least 15 annotated genes in order to ensure robust signals; this set comprised 1651 GO terms. In 98.6% (1628/1651) of cases, genes associated with each term had significantly larger module-weighted annotations than did other genes, indicating that module-weighted GO annotations do recapitulate Boolean annotations.
We ranked each GO term by the weight of its most strongly associated gene (Table 1). GO terms with the most highly weighted genes were "ribosome" (CC) and "structural constituent of ribosome" (MF). Consistent with this finding, genes involved in ribosome biogenesis are known to be transcriptionally co-regulated (Wade et al. 2006). On the other hand, GO terms associated with signal transduction and kinase activity showed the lowest gene weights. These results are statistically significant (see Supplemental Methods methods) and show that some types of gene groupings (e.g., genes encoding kinases) often used in the analysis of gene expression data may not actually represent functions that are coordinately regulated at the transcript level. Therefore, we suggest that researchers use caution when interpreting a result that certain annotations, like "kinases", appear to be enriched in a transcriptomics experiment.
A sensitive test will tend to give similar results despite small amounts of random noise being added to the input data. To test the sensitivity of a module-weighted annotation-based enrichment test, we obtained gene fold changes for the 5 most recent (at the time of query) C. elegans Affymetrix experiments deposited to the GEO database. These experiments were not included in the data used to construct the modules. We then added varying amounts of Gaussian noise to the fold changes and for each level of noise, we calculated enrichment z-scores for each GO term using three different methods: the Kolmogorov-Smirnov (KS) test, the t-test (n.b., one-sample and two-sample t-tests produced nearly identical results, data not shown), and scalar projection of the gene fold changes onto a module-weighted GO term matrix. We then compared these results to those obtained by each method when no noise was added (Figure 3a). Dissimilarity to the initial results (zero added noise) increased rapidly with added noise for both the KS test and t-test, while the projection-based results were similar (r . 0.75) at even the highest noise levels tested (5 standard deviations), suggesting that the projection-based test is more sensitive than both the KS and t-tests.
A specific test will tend to show dissimilar results when given dissimilar inputs. To test the specificity of module-weighted annotation analysis, we selected the 100 most dissimilar experiments (correlation of gene fold changes near zero) from a set of 188,805 comparisons. We compared the results of projecting the gene fold changes from each experiment onto the module-weighted annotation matrix to the results obtained from the KS and t-tests and found that the experiments showed the weakest similarity in the significance levels of annotations when using the projection method (P = 5.2e-13, Figure 3b). These results show that annotation enrichment analysis using module-weighted annotations may provide more reliable biological insights than gene set enrichment analyses that rely on the KS (GSEA (Subramanian et al. 2005)) or t-tests (PAGE (Kim and Volsky 2005), GAGE (Luo et al. 2009)). Module-weighted GO terms for each gene (matrix R, Figure 1c) are provided in File S4 and their significance in a query gene expression dataset can be tested using Tool 6 on http://genemodules.org/.
Module-weighted promoter words: While any Boolean gene annotation may be converted into a module-weighted annotation, modulebased weights seem particularly well suited for describing regulatory sequences, such as putative transcription factor and microRNA binding sites. To this end, we generated weighted annotations for each of the 5230 words in the promoter word dictionary we constructed using the Mobydick algorithm (described above and in Methods).
To validate predicted regulatory word weights, we searched the literature, the JASPAR database of transcription factor binding profiles (Mathelier et al. 2016), and the GEO database for C. elegans transcription factors with both an experimentally characterized DNA-binding profile and, separately, a microarray experiment that measured gene expression in a loss-of-function mutant of the transcription factor gene. This search yielded 6 transcription factors: daf-12, daf-16, hif-1, hlh-30, lin-14, and nhr-23. We then projected the loss-of-function microarray experiment data (positively and negatively changing genes separately) onto the module-weighted word matrix, R (Figure 1c), and calculated z-scores for each word. Finally, we compared the top-scoring words to the DNA-binding profiles of the respective transcription factors. If the predicted promoter-word weights are accurate, then words that resemble the binding profile should score highly in this analysis.
For hif-1 and nhr-23, the most significantly enriched words in the positively and negatively changing genes, respectively, matched the canonical binding sites (Table 2). A word matching the hlh-30 binding site scored 6 th overall among the up-regulated genes, and for daf-12, four of the top 20 words for the up-regulated genes contained GAACT or AACTT, which partially matched the reverse compliment of a reported daf-12 binding half-site, AGTTCA (Shostak et al. 2004). In the daf-16 data set, several words matching the so-called "daf-16 associated element" (DAE) (Zhang et al. 2013) scored highly. However, none of the four words matching the canonical daf-16 binding site, T(G/A)TTTAC were among the words comprising our Mobydick promoter-word dictionary, precluding these from being represented in the analysis. The canonical binding site for the final transcription factor, lin-14, is GAAC, but like the canonical daf-16 binding site, neither this word nor its reverse compliment was present in the promoter word dictionary, precluding it from representation. Taken together, these results suggest that module weighted regulatory sequences can be used to determine important regulatory sites in a gene expression experiment, and further validate our method for calculating weighted annotations.
Finally, we tested whether module-weighted words could reveal activity of the transcription factor HIF-1, previously shown to be required for longevity of the isp-1 mitochondrial respiration mutants (Lee et al. 2010). We calculated promoter word z-scores for the isp-1 microarray data set and compared the results to those for hif-1. As predicted, we observed a very strong anti-correlation between the promoter word z-scores for isp-1 mutants and those for hif-1 mutants (Figure 3c, R = -0.581, P , 2.2E-16) and four of the six most active words matched the canonical HIF-1 binding site [(A/G)CGTG; underlined in Figure 3c]. These results further support the utility of module-weighted promoter oligonucleotides for identifying biologically relevant regulatory sequences directly from gene expression data.

Informative descriptions and annotations of DEXICA modules
Gene co-expression modules are useful for interpreting gene expression data because they often represent transcriptional signatures of biological processes (Segal et al. 2003b). Therefore, detailed information about modules that are active in a gene-expression experiment would be of primary interest to an investigator. We showed above that C. elegans gene modules constructed using DEXICA are biologically informative. As a comprehensive resource, we created a Module Annotation Page for each of the 209 modules (see Data availability, Tool 3). Each page shows the module's significantly enriched GO terms (excerpted in File S2), tissues of known n■ Table 1 GO categories with strongest and weakest gene weights. The table shows the top 25 and the bottom 5 GO categories, ranked in decreasing order of the weight of their most strongly associated gene prior to normalization. Similar GO categories are grouped together, and the genes with the highest weights for each group are also shown. rpl-(ribosomal protein, large subunit) and rps-(ribosomal protein, small subunit) genes encode ribosomal proteins, and his-(histone) genes encode nucleosome components expression, enriched promoter and 3'-UTR oligonucleotide words, transcription-factor binding sites predicted using HOMER (Heinz et al. 2010) and component genes (each linked to a WormBase description). The component genes can be interrogated further for enrichment of additional gene annotations or regulatory motifs [e.g.
using RSAT (Nguyen et al. 2018)] to gain a deeper understanding of a module's function. A limitation of using GO terms and other pre-existing annotations to interpret gene modules is that some cellular activities are poorly annotated (see UPR mt example in section 1 below). For this reason, an Figure 3 Performance of moduleweighted annotation-based tests. (a) To assess sensitivity, Gaussian noise was added to gene expression data from 5 recent C. elegans Affymetrix experiments not included in the compendium used to train the modules.
[The standard deviation of the noise distribution (m = 0) was varied from 0.5x to 5x the standard deviation of the gene fold change distribution.] At each noise level, z-scores for each GO term (based on 100 random permutations of the gene IDs in the input data) were calculated using three different methods: KS -Kolmogorov-Smirnov test; t-testtwo-sample t-test (onesample test gave highly similar results, data not shown); projectionprojection of the gene fold changes into the space defined by the module-weighted annotation matrix, R. When conducting KS and t-test, fold changes of genes assigned to each GO term were compared to fold changes of genes not assigned to that term. Spearman correlation coefficients were calculated between Z-scores at each noise level and Z-scores without any noise added. (b) To assess specificity, 100 highly dissimilar pairs of experiments (Spearman correlation, r, of gene fold changes near 0) were selected from a set of 188,805 pairs generated using published microarray data. For each contrast belonging to a pair, we determined Z-scores for GO annotations using the three methods and calculated the rank correlation (rho) of the absolute value of these Z-scores between pair members. The center of the box represents the median value and whiskers extend to the most extreme data point that is not further than 1.5 times the IQR from the box. (c) Gene fold changes in isp-1 and hif-1 mutants were projected into promoter word space using moduleweighted promoter words (see Methods). The 10 points furthest from the origin are highlighted with colored circles in the figure (orange = positive SVE for hif-1, blue = negative SVE for hif-1), and their words corresponding to these points are shown to the right of the figure. Four of the six words highlighted in orange contain full or partial matches to the canonical hif-1 binding site, (A/G) CGTG (underlined). additional strategy to understanding what a given module represents is to examine its activity under a variety of conditions. To facilitate this, in the Module Annotation Pages we have also provided a ranked list of perturbations (from the 716 we generated) that activate each module significantly (also see File S2). While it is unlikely that a typical gene module can be described completely using a single semantic annotation due to the complexity of gene regulatory networks, we suggest processes that each module likely captures based on our comparison of module activity under several conditions and our examination of enriched GO terms (File S2). In a typical workflow, once the active modules in a gene expression experiment are identified by a researcher, we recommend that the researcher consult Module Annotation Pages and File S2 to infer the biological process(es) described by the modules of interest.

Guide to using DEXICA modules
Gene modules can uncover processes that are poorly annotated and therefore missed by conventional analyses: Mutations in components of the electron transport chain reduce respiration, slow development and reproduction and increase lifespan in C. elegans (Dancy et al. 2014). The mitochondrial unfolded protein response (UPR mt ) is induced in response to a stoichiometric imbalance between nuclear and mitochondrial proteins within mitochondria, and this process is known to be active in long-lived respiration mutants Durieux et al. 2011). However, analysis of significantly differentially expressed genes in the isp-1 respiration mutant, either using conventional GO term-or KEGG pathway-overrepresentation tests or using a more recent tool designed to query more comprehensive annotations of C.elegans genes, WormCat (Holdorf et al. 2020), failed to identify UPR mt or, for that matter, any process related to the mitochondria (Figure 4a). We asked whether application of DEXICA modules could successfully identify a UPR mt gene expression signature in isp-1 mutants.
As a first step of gene expression analysis using DEXICA modules, we determine a fold change of every detected gene between isp-1 mutants and wild-type animals. This calculation is done without imposing any cut offs on the data (e.g., p-values or effect sizes). We then project the resulting vector onto the module definition matrix (S, see Figure 1), which can be done by using Tool 1 of the online user interface (http://genemodules.org/). Three modules -m47, m66 and m169 -displayed the highest activity (Figure 4b). Module activity is represented as Signed Variance Explained (SVE), where + orindicates direction of activity (useful when comparing two or more experiments), and the absolute value of SVE represents the relative proportion of variance in a gene expression experiment explained by a particular module. The sum of absolute values of SVE for all modules is 1.
By consulting Module Annotation Pages (see section "Annotation of DEXICA modules" and Tool 3 on http://genemodules.org/), we determine that the same three modules are strongly active in other respiratory-chain mutants, but also that the most strongly activating non-respiration perturbation of m47 is a mutation in spg-7, a mitochondrial protein quality-control protease, disruption of which is known to induce UPR mt (Pellegrino et al. 2014) (Figure 4b a plot like this can be generated by calculating module activity in each experiment using Tool 1 and plotting the resulting SVE values against each other). This finding suggests that m47 may encompass UPR mt genes. If true, then m47 should be active in animals with constitutive activity of the transcriptional UPR mt regulator, ATFS-1. To test this prediction, we calculated module activity in atfs-1 gain-of-function mutants (this perturbation was not part of the original compendium, GEO Accession number GSE73669). Indeed, induction of the UPR mt transcriptional response in otherwise healthy animals using this mutation strongly and specifically induced activity of m47 ( Figure  4c). Furthermore, genes that belong to m47 showed concordant expression in the isp-1 respiration mutant, in response to UPR mt induction by disruption of spg-7, and in response to constitutive activity of ATFS-1 in normal animals (Figure 4c, d). As expected, mutation of atfs-1 prevented these changes in animals with an n■ Table 2 Top scoring promoter words for transcription factor perturbation experiments. The table shows the top three most significant words (those with the most extreme z-scores), calculated using module-weighted promoter words, for each transcription factor loss-offunction microarray experiment. If a full or partial match (4 or more bases) to the canonical binding site of the factor does not occur within the top 3 ranking words, the most significant such match is also shown. Bold letters indicate matching positions to canonical binding sites  Figure 4 Analysis of module activity in isp-1 respiration mutants reveals activity of the key mitochondrial unfolded protein response (UPR mt ) transcription factor, ATFS-1. (a) Conventional annotation analyses of genes that are differentially expressed in isp-1(qm150) respiration mutants. GO term-overrepresentation was determined using GOrilla (Eden et al. 2009) and KEGG pathway-overrepresentation was determined using induced UPR mt (spg-7 RNAi) (column 4 Figure 4d). Moreover, two chaperone genes known to be induced during UPR mt (hsp-6 and dnj-10) are part of m47. Taken together, these data strongly suggest that m47 represents genes induced by ATFS-1 during the UPR mt . This is an example of how identifying active gene modules in a mutant of interest and finding other conditions that activate the same modules can reveal biologically informative gene signatures. Analysis of module activity, but not GO term or KEGG pathway enrichment analyses, was able to identify correctly a key biological process using gene expression data from isp-1 mutants. Why did GO term enrichment analysis fail to recover the "mitochondrial unfolded protein response" term (GO:0034514)? To our surprise, there are only eight C. elegans genes annotated with this GO term (haf-1, ubl-5, gcn-2, atfs-1, clpp-1, dve-1, hsp-6 and hsp-60). Among these, only hsp-6 and hsp-60 are induced during UPR mt , whereas the others are genes needed for activation of UPR mt . In the KEGG pathway database, UPR mt is not included as a discrete entry, but is part of the "Longevity regulating pathwayworm" (map04212) entry, and similar to GO, primarily includes inducers rather than mediators of the UPR mt . These examples illustrate how incomplete annotations and/or overly general groupings (e.g., containing both activators and mediators) can result in a failure of standard annotation enrichment methods to detect biological signals in gene expression data.
Nargund et al. have defined a set of ATFS-1-targeted genes based on up-regulation of these genes with or without atfs-1 and in the absence or presence of mitochondrial stress ). While module analysis was able to reveal UPR mt without reliance on any pre-existing gene annotations or pre-defined gene sets, we wondered whether GSEA (Subramanian et al. 2005) analysis of isp-1 gene expression data using this gene set would have been able to detect ATFS-1 activity. The Nargund et al. ATFS-1 gene set showed enrichment within isp-1 genes, but this enrichment was not statistically significant ( Figure S4). In contrast, when we tested enrichment of the ATFS-1 gene set within modules (enabled by Tool 2 on http:// genemodules.org/), we found a highly significant enrichment within m47 (Figure 4e). These results show that given a set of functionally related genes, testing enrichment of that set within modules can be more informative than testing enrichment within a ranked list of gene changes, likely because modules comprise groups of genes that are functionally related.
Gene modules can uncover gene signatures despite a lack of significant differential expression: Reduction-of-function isp-1 mutations extend lifespan in a manner dependent on the hypoxia inducible factor 1 (HIF-1) (Lee et al. 2010), but, unexpectedly, we found that the overlap between the significant genes obtained by comparing isp-1 or hif-1 mutants to wild type was not statistically significant (Χ 2 test p-value = 0.17; Figure S5a). The other highly active modules in isp-1 mutants besides m47 are m66 and m169. We wondered if these modules represent genes regulated by HIF-1. Indeed, when we compared gene module activity in hif-1 and isp-1 mutants, we found a strong anti-correlation (Pearson correlation between SVE = -0.730, P = 4.7E-36; Figure S5b) driven by m66 and m169. The negative correlation is consistent with the finding that the life extension observed when isp-1 activity is reduced requires activity of HIF-1. The idea that HIF-1 directly regulates transcription of genes in m66 and m169 is further supported by the fact that the canonical HIF-1 binding site [(A/G)CGTG] is the top oligonucleotide sequence enriched in the promoters of the genes comprising these modules (qvalues 8.13e-105 and 1.47e-28, respectively). The similarity between module activity in isp-1 and hif-1 datasets, despite a lack of similarity among their most differentially expressed genes, suggests that the role of HIF-1 in regulating the lifespan of isp-1 mutants may be to instigate small but coordinated expression changes in many genes, most of which fail significance tests for differential expression in one or both datasets. These results demonstrate that gene modules are sensitive toward identification of transcriptional signatures and therefore can be useful for analysis of datasets in which few gene changes reach statistical significance.
Gene modules as a hypothesis generation tool: As shown above, the correlation between the module activities of hif-1 and isp-1 was substantial (r = -0.730), and therefore served as a strong validation of the module approach to uncovering functionally significant processes. To determine how often two sets of module activities generated from different experiments could be expected to show this degree of similarity, we determined module activity correlations for all possible pairs of the 716 experimental contrasts, excluding pairs in which both contrasts originated from the same experiment (i.e., from the same GEO series). This produced 188,805 contrast pairs, 13,376 (7.08%) of which showed a statistically significant correlation (Holm corrected p-value ,0.05). As expected, the highest-ranking pairs were the same experiment performed by different labs or at different times. The strength of the hif-1 and isp-1 correlation would have fallen within the top 1%, had those experiments been part of the contrast set. Interestingly, had we observed the similarity between the hif-1 and isp-1 projections without any prior genetic data, we might have hypothesized a role for hif-1 in isp-1 mutants before such a role was discovered from genetic screening (Lee et al. 2010). Therefore, to give the reader a chance to find other unexpected correlations between perturbations (i.e., gene expression changes that activate similar modules but that are generated by different experiments), we include the full set of contrast comparisons in File S3. We hope these comparisons will prove useful for hypothesis generation.
Guide to using module-weighted annotations To aid in functional interpretation of gene expression data despite sparse annotations of many C. elegans genes, we have developed a process of transferring gene annotations based on the degree of gene co-expression to create "module-weighted gene annotations" (see description above). This approach provisionally transfers annotations, such as GO terms, that are enriched in a module to all gene members of that module. This can be useful in at least two ways: for studying individual genes and for identifying functional categories within gene expression data.
For studying individual genes: if a query gene has a strong module-weighted association to a GO term with which it is not traditionally associated, it means that this gene is co-expressed with other genes that are traditionally associated with the GO term. For example, by applying Tool 5 (http://genemodules.org/) to the small ribosomal subunit S16 (rps-16), we find that most of the moduleweighed GO terms associated with this gene have something to do with the ribosome or translation, indicating that rps-16 is tightly co-expressed with other genes involved in ribosome biogenesis and not much else. On the other hand, while sod-3, a superoxide dismutase, does have significant weights for GO terms with which it is traditionally associated (e.g., oxidoreductase activity, superoxide metabolic process and response to superoxide), "catalase activity" ranks more highly for sod-3. This indicates that sod-3 is co-expressed with genes annotated as catalases. While it is unlikely that sod-3 has a novel catalase activity, it is more likely that expression of sod-3, a superoxide dismutase, is coordinated with expression of catalases, since hydrogen peroxide produced by sod-3 is further degraded by catalases to avoid damage to the cell. Some of the other top-ranking module-weighted GO terms for sod-3 describe dioxygenase activity. Similarly, this may indicate that dioxygenases generate superoxide radicals, and therefore increased expression of these enzymes is typically correlated with increased expression of a superoxide dismutase. Therefore, analysis of module-weighted annotations of individual genes can help form hypotheses about novel gene functions and transcriptional co-regulation of distinct biological processes.
The second application of module-weighted annotations is similar to a common analysis scheme whereby over-represented annotations (e.g., GO terms) are identified within a list of significantly differentially expressed genes. The key difference is that traditional annotations have been extended using information about gene co-expression (i.e., gene modules). Furthermore, this approach leverages all expression data, even genes that do not change in a significant way. Instead of calculating enrichment p-values, Z-scores are calculated based on the strength of the module-weighted association between each gene and each annotation and the magnitude of that gene's fold-change in the list of gene fold-changes. This analysis scheme is more selective and sensitive toward identifying functional categories of genes that change in response to a treatment than traditional over-representation analyses (see section "Application of DEXICA modules to improve gene annotations" above). Tool 6 on http://genemodules.org/ accepts a list of gene fold-changes and returns a table of module-weighted GO terms ranked by their predicted activity.

DISCUSSION
We have captured much of the transcriptional wiring of C. elegans by extracting gene co-expression modules from a large compendium of gene expression data. These 209 modules represent RNA-level signatures of diverse biological processes that can occur in C. elegans and, along with the tools we provide, can be used as a resource for analyzing gene expression data, or in fact to interrogate gene lists of any sort. Because genes are grouped into modules based purely on how they behave in experimental assays, and because modules encompass both previously-annotated and unannotated genes alike, modules can reveal signals within transcriptomic data that otherwise would be missed due to incomplete knowledge of gene function, subtle gene expression changes or noisy data.
Experimentally, gene-expression modules can be used to deconvolve complex phenomena into subsets of co-regulated genes, genes that likely act together to mediate a specific process. The modules can be annotated extensively, as we have done, and these annotations can be applied provisionally to all genes in the module. Regulatory factors associated with specific annotations (like 59 or 39 oligonucleotide words) can be implicated as well, and functional links can be revealed between dissimilar conditions that activate the same modules (e.g. see File S3).
ICA has been applied to the prediction of gene modules before, but we could find no examples in the literature of optimizing the process using biological metrics in the manner that we describe. Combined with the improved ability to partition independent components provided by our artificial neural network approach, we expect that DEXICA will be useful for constructing gene modules for other organisms. The DEXICA software and our C. elegans modules and data are freely available as R packages online (see Data availability).
While we have taken steps to maximize module prediction accuracy for the microarray compendium we assembled, many additional gene modules may exist in C. elegans that were not perturbed sufficiently in the samples comprising the compendium to be detected. These gene modules would remain hidden. As new areas of research are explored and new experiments are published, however, new gene modules will be discovered. For example, most of the gene expression data in our compendium was collected using whole animals. Data collected from isolated cells or tissues could help produce modules that are active in a relatively small number of cells. The DEXICA package can be used to create new and improved gene modules using compendia with expanded information content.
The impetus for constructing gene modules in C. elegans was to create gene groupings that do not rely on existing annotations, because annotations of many genes are missing or incomplete. We then wondered whether numeric scores based on gene co-expression could actually improve the existing gene annotations, leading us to develop the concept of module-weighted annotations. By weighting an association between each gene and each annotation by the degree to which that annotation appears to predict gene modularity, annotations that are shared by co-expressed genes are "boosted" and those that are not are diminished. It is important to note, however, that relevance scores between genes and annotations could be calculated using other metrics of gene behavior as well. For example, a gene might be weakly associated with a term in the context of gene expression but strongly associated with it in the context of physical protein interactions.
We have found that analysis of module-weighted GO terms is less sensitive to noise than are typical statistical over-representation tests. Furthermore, because module-weighted annotations incorporate information about gene expression, they effectively model the genegene correlation structure of the system. This is useful because typical over-representation tests do not perform well with gene sets that have a high level of gene-gene correlation [i.e., annotations assigned to genes that are strongly co-expressed are more likely to be significant (Goeman and Buhlmann 2007;Gatti et al. 2010;Tamayo et al. 2016)]. GSEA, for example, deals with gene-gene correlation issue using permutation. We show that module-weighted GO terms produce significantly fewer false positives than do over-representation tests (see Figure 3b). Therefore, we think that module-weighted annotations are a promising new way to address the gene-gene correlation problem and are working to develop it further for annotation enrichment analysis.
Module-weighted annotations may be especially useful for promoter/39-UTR word analysis. Projection of gene expression changes onto the module-weighted word matrix allows identification of potential regulatory sequences directly from a list of gene fold changes, bypassing many steps required by traditional regulatory sequence detection (sequence retrieval, repeat masking, over-representation analysis, and background correction). More importantly, this analysis is more likely to yield true positive results because it makes use of a large amount of historical data (via the modules) to filter out sequences with no apparent regulatory function.
We used module-weighted word analysis to analyze data from six transcription factor perturbation microarray experiments. Four of the six produced high scoring promoter words that closely matched the known DNA binding sites of the corresponding transcription factors. For the two that did not, daf-16 and lin-14, words that exactly matched the factors' canonical binding sites were not present in the promoter word dictionary generated by the Mobydick algorithm, and results for these factors could be poor for this reason. Interestingly, the second highest scoring word for the genes down-regulated upon daf-16 perturbation was GGAAG, and this word occurs twice more as a substring among the top 20 scoring words. This sequence is a partial match to an alternative daf-16 binding site reported in hookworm (Gao et al. 2010), G(A/G)(C/G)A(A/T)G, suggesting that this site may be functional in C. elegans as well.
Finally, because annotations shared among a significant fraction of the genes in a module are "transferred" to all genes in the module, module-weighted annotations can be used provisionally to infer novel functions for genes, which is especially useful for studying poorly annotated genes.