iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning

Abstract Independent component analysis (ICA) of bacterial transcriptomes has emerged as a powerful tool for obtaining co-regulated, independently-modulated gene sets (iModulons), inferring their activities across a range of conditions, and enabling their association to known genetic regulators. By grouping and analyzing genes based on observations from big data alone, iModulons can provide a novel perspective into how the composition of the transcriptome adapts to environmental conditions. Here, we present iModulonDB (imodulondb.org), a knowledgebase of prokaryotic transcriptional regulation computed from high-quality transcriptomic datasets using ICA. Users select an organism from the home page and then search or browse the curated iModulons that make up its transcriptome. Each iModulon and gene has its own interactive dashboard, featuring plots and tables with clickable, hoverable, and downloadable features. This site enhances research by presenting scientists of all backgrounds with co-expressed gene sets and their activity levels, which lead to improved understanding of regulator-gene relationships, discovery of transcription factors, and the elucidation of unexpected relationships between conditions and genetic regulatory activity. The current release of iModulonDB covers three organisms (Escherichia coli, Staphylococcus aureus and Bacillus subtilis) with 204 iModulons, and can be expanded to cover many additional organisms.


INTRODUCTION
The transcriptional regulatory network (TRN) governs gene expression in response to environmental stimuli, which is of fundamental interest in biology. The TRN functions by employing condition-responsive regulators, such as transcription factors (TFs), to regulate the transcription of genes. Understanding these regulators and their effects in bacteria informs many important applications, including bioproduction (1) and antibiotic resistance (2). There are several organism-specific databases of gene-regulator relationships (3)(4)(5), but knowledge of regulator binding alone is insufficient to explain the complex responses reflected in transcriptomic datasets (6,7). Thus, there is a need for datadriven approaches to TRN elucidation, which can detect the most important transcriptional signals in a gene expression dataset, identify their major gene constituents, and quantify their condition-dependent activity levels.
With the growing availability of bacterial transcriptomes, machine learning is emerging as a powerful tool for TRN elucidation. The falling price of RNA sequencing has led to a rapid growth in online transcriptomic databases (8,9), creating a strong need for the development of analytical tools that can harness its scale to transform raw data into biologically meaningful information (10). For transcriptomic data, this knowledge comes in the form of (i) identifying which regulons are active in each condition probed in the dataset, (ii) generating hypotheses about gene function and regulation and (iii) revealing novel relationships and patterns in bacterial lifestyles. In comparison, traditional methods such as chromatin immunoprecipitation (ChIP) assays (11), can be time-consuming and expensive, making them cumbersome for high-throughput discovery or hypothesis generation. They also do not yield the conditionspecific strength of binding, which can be inferred by machine learning. Another strength of data-driven approaches is that they can be applied to any organism, regardless of prior information. Ultimately, a comprehensive, quantitative TRN would be the result of this pursuit.
Independent component analysis (ICA) addresses the goals described above. It is a blind source separation algorithm that identifies statistically independent signals underlying a dataset, and decomposes the original matrix into source (or module, M) and activity (A) matrices (12,13). The module matrix M defines relationships between genes and the identified signals, while the activity matrix A describes the intensity of each signal in each sample. A comparison of 42 TRN inference methods demonstrated that ICA was the best at recovering known gene modules (14). Additional studies found that ICA-derived gene modules were robust across datasets (15). It has been used for a variety of organisms in the past, especially yeast and human cancer cells (16)(17)(18)(19)(20).
We recently applied ICA to high quality transcriptomic datasets for three species of bacteria: Escherichia coli (21), Staphylococcus aureus (22) and Bacillus subtilis (23). In our work with E. coli, we termed the independent signals 'iModulons', for independently modulated gene sets. We used the same codebase (www.github.com/SBRG/precise-db) to generate all three transcriptome decompositions. We then assigned categories, functions, and regulators to each iModulon. The regulator assignments were based on existing knowledge of transcription factor binding sites (3)(4)(5)22), or, in some cases, a boolean combination of several regulons. By comparing our decompositions to known regulons described in other databases, we were able to identify potentially mislabeled or previously unknown TF-gene associations and verify them with ChIP-exo binding profiles. For example, we discovered five new binding sites for MetJ in E. coli (21). The observed iModulon activity levels also provided additional evidence for newly discovered regulatory roles or point to new hypotheses. For instance, we observed that the iron chelator pulcherrimin was active in stationary phase signaling in B. subtilis (23). This observation was recently externally validated (24). More broadly, each decomposition explained 65-80% of the variance in the transcriptomic data sets, indicating that the functions and regulators identified captured most of the transcriptional functions of the TRNs under the conditions where the data was obtained.
Several other studies have utilized iModulons to obtain valuable results. Our original E. coli study led to the discovery of a regulon putatively controlled by the uncharacterized TF ydhC (21), which has since been characterized and renamed (25). iModulons were also used to characterize the function of the TF OxyR (26), and to quantify the stress responses to heterologous gene expression (27). Additionally, iModulons captured the transcriptional effect of genomic alterations in adaptive laboratory evolution to naphthoquinone-based aerobic respiration (28).
In previous publications, curated iModulon dashboards were presented as static supplemental PDFs. While these are useful for disseminating basic information on specific iModulons, they suffered from several problems: (i) labeling of genes and inclusion of tables were limited by page space; (ii) the inability to interact via hovering and clicking slowed analysis; (iii) information was static and could not be updated; (iv) access to the underlying data was lim-ited and required coding experience and (v) dissemination required sharing large PDFs instead of simple URL links. Given the potential for iModulons to enhance a wide range of research, we sought to address these issues with an online knowledgebase.
Here, we present the iModulon database (iModulonDB; imodulondb.org), an interactive web tool for accessing high quality transcriptomic datasets and their curated iModulons. Users begin by selecting their organism and dataset of interest, and then may either browse the iModulons identified through curated tables or search for the genes and regulators of interest to them. Each gene and iModulon has an interactive analytics dashboard featuring hoverable, clickable, and downloadable tables and graphs. iModu-lonDB presents the relationships between genes and iModulons, the inferred activities of regulators across diverse conditions, and the concordance between our data-driven gene modules and literature-defined regulons, such as those available on RegulonDB (3). Compared to the previous PDF dashboards, the new website enables global usage of this information and provides much more depth. iModu-lonDB meets the emerging need for an online, data-driven TRN resource; it obtains co-regulated gene sets based only on transcriptomic observations, which will help a broad audience of microbiologists and systems biologists to investigate genes or iModulons of interest.

Data generation and acquisition
Escherichia coli PRECISE (21) and StaphPRECISE (22) were generated using RNA sequencing (RNA-Seq), as described in their respective publications. The Bacillus subtilis dataset (23) is a well-known microarray dataset originally published by Nicolas et al. (29), and is already featured as the expression compendium on the popular database SubtiWiki (4). Despite using older, established data, the Bacillus decomposition still provides novel insights, which demonstrates the efficacy of ICA. All raw data is available on GEO or SRA.

Quality control and preprocessing
Prior to running ICA, we ensure that transcriptomic data passes stringent quality control as described in each dataset's original publication (21)(22)(23). For RNA-Seq data, genes shorter than 100 nucleotides or with under 10 fragments per million-mapped reads are removed. We then compute transcripts per million (TPM) using DESeq2 (30). The final expression compendium is log-transformed log 2 (TPM + 1) before analysis -this is the form of the data available for download on iModulonDB. For both RNA-Seq and microarray data, biological replicates with R 2 < 0.9 between final expression values are removed to reduce technical noise. Before computing iModulons, we choose a reference condition (such as exponential growth on M9 minimal media) and subtract its expression from all other samples; this results in activity levels for each iModulon relative to a known baseline. , which is quality controlled for high biological replicate correlation. ICA is used to obtain iModulons, which have gene weights for each gene in the matrix M and activity levels for each condition in the matrix A (upper right). M weights are analogous to the strength of transcription factor binding upstream of genes and A activities are analogous to condition-dependent transcription factor activities, which may depend on processes such as ligand binding. Note that X≈M*A. The iModulons are curated by assigning functions, categories, and regulators, as shown in the example MalT iModulon dashboard. A representative screenshot of the interactive graphs is shown (lower left; larger version with labels included in Figure 2), covering genes and activities as well as the concordance between this gene set and its curated regulator. The three rows of the dashboard result in the three categories of novel insights described (lower right).

Computing robust iModulons
We perform ICA as described in the original publications (21-23) using the Scikit-learn (31) implementation of the FastICA algorithm (32). To ensure robustness (since ICA is a stochastic gradient search algorithm), we perform ICA multiple times with random seeds for each dataset and cluster their M matrices using the Scikit-learn implementation of the DBSCAN algorithm (33). We keep independent components that appear in more than 50% of the ICA runs. Code to compute robust independent components is publicly available (github.com/SBRG/precise-db). Note that previous publications used S to refer to the M matrix, but it has since been renamed to avoid confusion with the stoichiometric matrix S (34).
ICA produces a set of independent components, each of which contains a weight for each gene in the expression dataset. Most gene weights are near zero for a given independent component, so the genes with large positive or negative weights are considered to be in the iModulon. To transform an independent component into an iModulon, we iteratively remove the highest absolute weighted genes from an iModulon until the D'Agostino K 2 statistic of normality (35) of the remaining distribution falls below an organism-specific cutoff, and take all removed genes to be iModulon members. See the original publications for additional details (21)(22)(23).

A web-based analytics platform for data-driven TRNs
We developed iModulonDB (imodulondb.org) to enhance the field of microbial genetic regulation by presenting TRNs based on observed signals in transcriptomic datasets. This site provides biologists with the ability to easily navigate large datasets and quickly find gene modules through a search tool or by browsing curated annotations. The inferred activity levels of each iModulon, which are readily available on the site, provide valuable, novel insights into cellular function. We hope that iModulonDB will become an important part of the database ecosystem, providing a machine learning-derived perspective that also links to other databases for synergetic TRN characterization.  iModulonDB currently contains the three datasets listed in Table 1. It covers 204 iModulons, 180 of which are characterized. The E. coli dataset is the largest; it includes 278 expression profiles with various gene knock-outs and evolved strains. This leads to its high dimensionality and lowest explained variance. The presence of genomic alterations also results in the most genomic iModulons (36), reducing the fraction of iModulons with known regulators. The S. aureus dataset is comparatively smaller, and is explained very well by its iModulons. Nearly all iModulons could be characterized, but the dearth of existing TRN annotations for this organism makes it more difficult to align certain iModulons to regulators. For B. subtilis, both mR-NAs and non-coding RNAs were included in the decomposition. This leads to a high number of 'genes' and 17 uncharacterized (noisy) iModulons. However, 95% of characterized iModulons were associated with a known set of regulators. Overall, the decompositions successfully produced iModulons that align well with our existing knowledge of transcriptional regulation in these species.

Dataset pages show lists of iModulons, constituting a datadriven TRN
Users begin by selecting one of our decompositions from the home page. This brings them to a dataset page (Figure 2A and B). The left sidebar (Figure 2A) lists basic features of the dataset, including organism details, dimensions, and a link to the relevant publication. The page also contains a curated table of all the iModulons identified (Figure 2B). iModulons are given a name, regulator, function, and category. When multiple regulators capture the iModulon better than a single one, they may be combined using 'or' (/) or 'and' (+) set operators. For example, the ExuR/FucR iModulon contains genes regulated by either ExuR or FucR, while the GutM+SrlR iModulon contains only genes regulated by both GutM and SrlR (see the original publications (21-23) for more details). If multiple iModulons are linked to the same regulator, which indicates subsets of a large regulon, then a hyphen and numeral is added to the iModulon name (e.g. Fur-1). Some iModulons are enriched for a function or genomic alteration instead of a regulator, so they are named accordingly. Three statistics are also included in the table: (i) N, the number of genes; (ii) precision, the fraction of iModulon genes regulated by the enriched regulator(s) and (iii) recall, the fraction of regulated genes in the iModulon. Users may sort the table by any of its columns, or click on a row to learn about the iModulon.
Some common iModulon categories are: carbon source utilization, amino acid and nucleotide biosynthesis, metal homeostasis, stress response, and lifestyle (such as biofilm production). There are also iModulons categorized as 'uncharacterized', which may indicate undiscovered genetic or regulatory relationships (36).

iModulon pages present information-dense interactive analytic dashboards
After selecting an iModulon, users are taken to its iModulon page ( Figure 2C-I). A gray box on this page ( Figure 2C) lists the curated features: function, category, and regulator. If available, the regulator entry will contain links to other databases (RegulonDB in E. coli (3) or SubtiWiki in B. subtilis (4)). If a regulator is assigned, the precision and recall of the enrichment is shown.
The gene table ( Figure 2D) contains information about all gene members of the iModulon. By default, the gene table is sorted by the associated gene weight from the M matrix (although users may click any column to sort by a different feature). Most columns contain information from external databases (EcoCyc in E. coli (37), AureoWiki in S. aureus (38), and SubtiWiki in B. subtilis (4)), such as gene product descriptions, operons, and associated regulators. If the iModulon has been assigned regulator(s), then the regulator names will appear as columns in the table, with green checks if the gene is known to be associated with that regulator and red X's if not. This table is valuable for understanding the iModulon, as browsing the gene function list helps to understand the function of the iModulon as a whole. The boolean transcription factor columns are also a tool for discovery, since the genes marked with red 'X's may be controlled by the regulator despite the lack of a known association. Rows are also links to gene pages. The gene histogram ( Figure 2E) shows the distribution of all gene weights for this iModulon. The gene scatter plot ( Figure 2F) presents the weights on the Y axis and the expression of each gene in the reference condition along the X axis. For details of the features and analysis of these two plots, see the supplement (Supplementary Results).
The activity bar graph ( Figure 2G) displays the activity level of the iModulon (with respect to a reference condition) across all expression profiles. iModulon activities are a measure of relative transcription factor activity, which can be a very valuable resource that is difficult to obtain using other methods.
If the iModulon has a matched regulator, then a 'Regulation' row will appear at the bottom of the dashboard ( Figure 2H and I), which quantifies the concordance between our transcriptomics-derived groupings (iModulons) and the gene groupings (regulons) available in literature and on other databases (3,4). It contains a venn diagram comparing the set of iModulon genes against the regulon (Figure 2H), which can be hovered over to quickly see agreements and discrepancies between iModulons and regulon annotations. If the regulator is also encoded by a gene in the expression compendium, then there will be a scatter plot or series of scatter plots showing the iModulon activity and regulator gene expression across the conditions ( Figure 2I). Each point can be hovered over to see the name of the sample it represents. For some regulators, such as sigma factors and self-regulating TFs, a high correlation may be observed in this plot. When post-transcriptional modifications such as phosphorylation and ligand binding affect the iModulon activity, we see low correlations between the expression of the regulator and the activity of the genes it regulates. Regulators are not necessarily part of their own iModulons. The computation of these correlations is described in the original dataset publications (21)(22)(23). In some cases, the correlation is best captured using a broken line (as in Figure 2I) to signify that a minimum expression level of the regulator must be reached before a correlation is observed.

Gene pages connect users to iModulons of interest
Gene pages enable users to quickly find iModulons relevant to their research. Similar to the iModulon and dataset pages, the left sidebar on this page will list the gene's identifier, gene name, gene product, operon, COG category, and known regulator(s), as well as a link to a relevant database (EcoCyc in E. coli (37), Aureowiki in S. aureus (38) and SubtiWiki in B. subtilis (4)). The dashboard contains two elements: (i) a table of iModulons and (ii) an expression profile. The expression profile shows log 2 TPM expression values across each dataset.
The iModulon table (included in Figure 3B) lists the iModulons in order of weight, with the strongest iModulon associations at the top of the list. If the gene is in an iModulon, a green check will appear next to it (red X otherwise). The table also includes additional details for each iModulon. Here, a user can easily find iModulons containing genes of interest, and navigate to the iModulon page to learn more. If a gene is not in any iModulon, the gene is not a strong part of any independent signals in the dataset.

Other major features: about, search and download
As many researchers are not familiar with iModulons, we have included a thorough 'About' page with a YouTube video on ICA, an illustrated walkthrough of our pipeline, and details on how to read our dashboards. We also include an email address (imodulondb@ucsd.edu) where users can send feedback or request that we work with their dataset.
After choosing a dataset, users may click the 'Search' link in the toolbar from any page to access our search functionality. This tool allows the users to search through genes and/or iModulons. For genes, results matching queried gene names, gene IDs, and gene products will be displayed. Similarly for iModulons, matching iModulon names, associated regulators, and iModulon functions will generate search results. Results are separated into the relevant iModulon and gene sections.
Each of the main three pages (dataset, iModulon and gene) has a download menu in the toolbar. From there, all data is available for download in bulk or parts. The bulk data includes the original log-tpm data, the two ICA-generated matrices, the gene annotations and literature TRN used, the metadata on all samples, and the curated iModulon table. For individual iModulons, the gene weights and activity levels are available along with the gene table as it appears in the dashboard. Likewise, gene pages support the download of all iModulon associations to that gene, its expression levels for each sample, and the iModulon table as it appears in the dashboard. We encourage data download and custom analysis.

Case study: how to use iModulonDB to enhance research
This section includes an example of a researcher who could gain valuable information from iModulonDB, and how they might do so. Figure 3 illustrates the process.
Researcher X has chosen a gene of interest. They are studying the branched chain amino acid (BCAA) synthesis enzyme encoded by ilvE in E. coli as a potential drug target. They want to know which conditions activate ilvE expression, and what other genes are co-expressed with this gene. Researcher X can navigate to the E. coli dataset on iModulonDB, and then choose 'Search' in the toolbar and enter 'ilvE' ( Figure 3A). ilvE will appear as a gene page result, which links to the ilvE gene page ( Figure 3B). From there, they can see some basic information about the gene, a link to its EcoCyc page (37), and its expression across our compendium. They will also find the iModulon table, which shows that it is a member of the 'Leu/Ile' iModulon. As expected, the function of the iModulon is BCAA biosynthesis. Clicking the 'Leu/Ile' row in the iModulon table will take them to the corresponding iModulon page ( Figure 3C) where they can see all of the genes (including ilvE) in this independent signal of the dataset, as well as its activity across all conditions. The iModulon is annotated as being regulated by ile-tRNA, ilvY or leu-tRNA. IlvY is a transcription factor with a page on RegulonDB (3), so users can follow that link to learn more about its regulon.
Along with linking to relevant pages on other databases, iModulonDB provides Researcher X with some novel information ( Figure 3D). The first would be the particular gene grouping of this iModulon--it covers the synthesis of all  Figure 2 for details). The activity bar graph includes an example tooltip, which appears when the bar is hovered over with a mouse. In this case, the bar with the tooltip is a repressing condition labeled with LB media, which is expected to repress the function of this iModulon. (D) The insights gained are summarized in the boxes: the co-regulated genes are enumerated, and the activating or repressing conditions in our dataset can be determined. three BCAAs plus threonine as a single unit of the transcriptome, instead of three separate regulons or several operons as they may find on other databases. This grouping is observed from transcriptomic data, and likely results from a combination of genetic factors that respond to separate but metabolically related metabolites. It is useful to understand the cell as manipulating all of these genes together as a unit, which may affect how the researcher implements their drug targeting. The other major insight would be gained by probing the activity profile; for example, hovering over the tallest bars shows that this iModulon is activated under iron starvation and reactive oxygen species (ROS) stress and hovering over the most negative bars indicates deactivation under rich media conditions or under osmotic stress from NaCl. Searching the literature for explanations would reveal that ROS and iron starvation damage the iron-sulfur clusters needed for member enzymes (39) and create BCAAlimiting conditions, while the amino acids in rich media turn off these genes through their well-studied mechanisms (40). Direct or indirect repression of these genes by osmotic stress has not been studied and may be worth investigating. All of this information would help Researcher X determine whether ilvE and the function encoded by the Leu/Ile iModulon are good potential drug targets.
Not all iModulons need to be found through gene pages. Users may be more interested in a general process or specific regulator, which can also be searched for on the search page. Alternatively, users can start by selecting an iModulon from the dataset page. They may stumble upon a sur-prising gene grouping or unexpected condition/iModulon activation pair, which could lead to valuable new hypotheses worth testing. This knowledgebase would also be very valuable for someone studying a completely uncharacterized gene, as its presence in an iModulon gives clues to its function. Another way to use this knowledgebase is simply to follow along with the existing publications (21)(22)(23). For E. coli and B. subtilis, expected and unexpected activating conditions for each iModulon were listed in supplementary tables, which may serve as a starting place for hypothesis generation.

Design and implementation
iModulonDB is implemented and deployed using a simple web application stack and combination of user interface technologies. The server side is entirely hosted by GitHub Pages (pages.github.com) with HTTPS enforced. It relies on local computations performed in Python 3.7 with Jupyter notebooks (jupyter.org). The client side is implemented using a combination of HTML, CSS, and JavaScript. Bootstrap 4.5 (getbootstrap.com) is used to manage page layout and ensure mobile compatibility. Interactive tables were made using Tabulator (tabulator.info), and plots are implemented in HighCharts using a non-commercial license (highcharts.com). Other JavaScript packages used include: jQuery (jquery.com), Popper.js (popper.js.org), URLSearchParams (developer. mozilla.org), and PapaParse (papaparse.com).

DISCUSSION
iModulonDB is a data-driven bacterial TRN knowledgebase that meets the need for unsupervised, observationbased gene groupings and inferred genetic regulator activities. Its interactive graphical user interface with search and download functionality facilitates usage by scientists with computational and non-computational backgrounds alike. The platform provides a novel perspective on bacterial genomes based on big data observations, and also connects to (and largely agrees with) databases that report established gene groupings based on past literature. It currently includes data for three organisms based on three publications and contains 204 iModulons, and will be expanded in the future to accommodate new datasets and organisms. This work demonstrates the potential for iModulonDB to improve our understanding of bacterial TRNs for a wide variety of organisms.

DATA AVAILABILITY
iModulonDB is freely available online at https: //imodulondb.org and can be accessed with a JavaScriptenabled browser. The download links in the toolbars enable download of all data and facilitate custom analysis.