GeneTrail 3: advanced high-throughput enrichment analysis

Abstract We present GeneTrail 3, a major extension of our web service GeneTrail that offers rich functionality for the identification, analysis, and visualization of deregulated biological processes. Our web service provides a comprehensive collection of biological processes and signaling pathways for 12 model organisms that can be analyzed with a powerful framework for enrichment and network analysis of transcriptomic, miRNomic, proteomic, and genomic data sets. Moreover, GeneTrail offers novel workflows for the analysis of epigenetic marks, time series experiments, and single cell data. We demonstrate the capabilities of our web service in two case-studies, which highlight that GeneTrail is well equipped for uncovering complex molecular mechanisms. GeneTrail is freely accessible at: http://genetrail.bioinf.uni-sb.de.


Motivation
The GeneTrail epigenomics workflow provides functionality to analyze histone marks, DNA methylation patterns, and open-chromatin regions of two or more sample groups. The epigenetic marks are used to identify genes that transition between certain chromatin states in the different groups. The gene sets of the different transition groups are then used to analyze affected signaling pathways. To this end, several analysis steps are carried out: First, our web service identifies, which epigenetic modifications are present in the promoter, enhancer and gene body regions of each transcript. This information then defines a combined chromatin state for all regulatory and coding regions that influence this transcript, i.e. active, poised, repressed, or no information. In the following sections, we call this the chromatin state of a transcript. The transcript information is then used to define the chromatin state of each gene. In the last step, The epigenetic marks are used to divide the genes into transition groups, e.g. genes that transition from a certain chromatin state in one sample to another state in a second sample. For all resulting transition groups, enrichment analyses are then carried out to identify associated pathways. An overview of the workflow is shown in Figure 1.

Input data
The input for the epigenomics workflow are genomic positions affected by a certain epigenetic mark. Currently, we support open chromatin regions, histone modifications, DNA methylation patterns, and associated gene expression measurements. In the following sections, we describe the file formats that can be used to upload different data sources.

Open chromatin regions or histone modifications
Open chromatin regions or histone modifications are expected to be given as peaks in a BED file format.

BED format
In this format every line represents a region of interest. Each individual line contains at least three fields.

DNA methylation patterns
DNA methylation can either be uploaded as IDAT file resulting from Illumina BeadArrays or as BED file from bisulfite sequencing experiments.

Gene expression patterns
RNA-Seq input has to be a tab-separated matrix of gene-or transcript expression values. The rows of this matrix are expected to be gene or transcript identifiers and the columns should represent measured samples. These samples can include replicates for each group. The matrix has to have row names and a header. All expression measurements have to be uploaded in a single matrix. After the matrix was successfully uploaded and parsed, a table will appear in which you can assign each measurement (column in the matrix) to a group.

Naming convention
After a file has been uploaded, it must be assigned to the associated epigenetic mark. In case the name of the uploaded file contains the mark this assignment is done automatically.

Upload
All files can either be uploaded individually or in form of a ZIP archive. After the files have been uploaded, they have to be assigned to the investigated sample groups.

Identifying epigenetic modifications in promoter, enhancer, and gene body regions of transcripts
We define a promoter of a transcript as window around the transcription start site (TSS). This window size is a parameter and can be 1000 base pairs (also written as 1kb), 2kb, or 5kb, resulting in a total promoter size of 2001, 4001 and 10001 bp, respectively, for each transcript. The gene body is defined as the union of the genomic positions of all exons. Enhancer regions are taken from the database GeneHancer [1]. This database comprises regions of possible enhancers and their possible gene targets for humans. The genomic positions for the TSS and exons are taken from GENCODE [2]. A promoter, enhancer, or gene body region is stated to be affected by an epigenetic mark, if at least one bp of the region is affected by the measured mark. For histone modifications and open-chromatin regions, we use bedtools [3] to test for an intersection of the genomic positions. For DNA methylation, we use RnBeads [4] to calculate a beta-value for each region that states the degree of DNA methylation in this region. We choose 0.8 as betavalue cutoff. As a result, each transcript is assigned a set of epigenetic marks for its regulatory and gene body regions.

Predicting a combined chromatin state of promoter, enhancer, and gene body regions of a transcript
An epigenetic mark can have a positive or negative effect on transcription, depending on its location. The tri-methylation of lysine 36 in histone 3 (in short H3K36me3) for example can positively affect transcription if it is found in the gene body [5,6]. Another example is DNA methylation that can have a negative effect on transcription if found in promoter regions [7]. In our analysis, the epigenetic modification pattern, affecting the regulatory and gene body regions of a transcript, is analyzed based on this a priori knowledge gathered from the specialized databases HIstome [8] and HHMD (Human Histone Modification Databse) [9], as well as from a manual literature search to complement this knowledge. Based on decision rules extrated from this knowlege base, we can assign a combined chromatin state to those regions, which is either 'active', 'poised', or 'repressed'. If no epigenetic marks are present in the promotor, enhancer, or gene body regions of a transcript, the chromatin state is set to 'no information'.

Predicting the chromatin state of a gene
If a gene codes for several transcripts, it might occur that the chromatin states for the regulatory and gene body regions of those transcripts are assigned to different values. In that case, we assume that the function of these transcripts are redundant and that any transcript can fulfill the function of the gene. Therefore, we assign a gene to the most active chromatin state of its transcripts. Hence, if at least one transcripts is active, the gene is assigned to be active. If this is not the case and if at least one transcripts is poised, the gene is assigned to be poised. If all transcripts are repressed, the gene is also assigned to be repressed. In the case that no transcript contains any epigenetic mark, the gene is assigned to have no information.

Performing enrichment analyses
In the last step of the workflow, the different uploaded sample groups are compared using enrichment analyses. For each pair of groups, we define 16 sets of genes, one for each chromatin state pair. Thereby, chromatin state pairs represent transitions between chromatin states of two groups. A gene is assigned to a chromatin state pair (e.g., active -repressed) if it is assigned to the first chromatin state in the first group (in the example active) and to the second chromatin state in the second group (in the example repressed). For these 16 sets of genes, we perform an Over-Representation Analysis (ORA).

Over-representation analysis
Let us assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set, which consists of n entries, and l entries in the reference (all investigated genes), which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.

Hypergeometric test
For this analysis, all elements of test set are always part of the reference. For this purpose, the hypergeometric test can be applied to compute a p-value for the analyzed category: lim−ln−i mn

Multiple testing correction
Since for each test set in our analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods (cf. [10]).

Motivation
The GeneTrail time series workflow provides functionality for the analysis of time resolved gene, protein, or miRNA expression data sets. In order to process these data sets, several analysis steps are conducted: our web service first uses a two-stage clustering approach to group biological entities with very similar expression patterns over time.
For the resulting clusters, enrichment analyses are carried out to identify associated pathways. An overview of the workflow is shown in Figure 1.  In general we recommend to upload already normalized and logarithmized data. This gives users the complete control over quality control, batch effect removal, and normalization. However, for RNA-Seq data GeneTrail providesa variety of normalization methods: log(TMM+1) [2], log(GeTMM+1) [3], log(TPM+1) [1], log(CPM+1).

Clustering
In order to group genes, proteins or miRNA with similar expression patterns, we carry out two clustering steps. First, a strict clustering is performed that generates small groups with a high similarity between all members. The second clustering then combines similar clusters generated in the first step to "super-clusters". These super-clusters give a general overview of the data. However, they can be further refined to investigate the contained subclusters and finally also the individual genes. For each stage of our clustering approach, we need to carry out four steps: (1) filtering the gene expression data , (2) calculating the distance between all gene pairs, (2) hierachical clustering, and (3) cutting the dendrogram to obtain clusters.

Filtering
In a preprocessing step, we remove all genes that only show limited expression changes over time. To this end, we implemented two metrics that summarize the overall expression changes in a time series t = {t 1 , ..., t n }. The result of a metric is then compared to a user defined threshold δ to remove unwanted genes, i.e. genes with low overall expression changes with respect to the chosen metric.

Absolute expression difference
The first measure calculates the absolute difference between the highest and lowest expression value in t.

Average expression change
The second method calculates the average expression difference over all consecutive time points.

Distance measures
In order to quantify the distance of two time courses, we implemented a variety of distance measures. For the description of these measures, we consider both curves as n-dimensional vectors p = (p 1 , p i , ..., p n ) and q = (q 1 , q i , ..., q n ), where i ∈ {1, ..n} represents time point i.

Distance measures for time points
In this section, we describe distance measures that calculate the distance between p and q based on their entries.

Euclidean distance
The Euclidean distance is a metric for distance between two points in Euclidean space. It is defined as

Minimized Euclidean distance
For the minimized version of the Euclidean distance, we shift one of the vectors such that the distance between the two is minimized. This means that height differences between the curves are ignored.
The optimal value for s can be found as follows:

Distance measures for transitions between time points
In this section, we describe distance measures that consider the transitions between the time points rather than the time points themselves.

Angle distance
We define the angle distance as the sum of all angles between the transitions of consecutive time points in the time series.
where θ is defined as the angle between the transition of two consecutive points.

Euclidean distance for gradients
This version of the Euclidean distance, calculates the distance between p and q based on the gradients of all consecutive time points.

Association measures
In addition to the distance measures described above, we can also use a variety of association/similarity measures. Before a clustering can be applied, these values have to be transformed to distance measures.

Pearson correlation
The Pearson correlation coefficient (PCC) is a measure for linear dependence between two random variables P and Q. We assume that p and q are samples drawn from these variables.
wherep,q and s P , s Q are the sample means and samples variances respectively.
The Pearson correlation coefficient r ranges from -1 to 1. A value of 1 implies that the relationship between P and Q is perfectly described by a linear function, with all data points lying on a line for which both P and Q increase. A value of 1 implies that all data points lie on a line for which P increases as Q decreases. A value of 0 implies that there is no linear dependence between the variables. This means we can define the following distance measure: d(p, q) = 1 − r(p, q)

Spearman correlation
The Spearman correlation coefficient (SCC) is a non-parametric measure for dependence between between two random variables P and Q. Here, we assume that p and q are samples drawn from these variables. The SCC assesses how well the relationship between two variables can be described using a monotonic function.
The rank rank(x i ) of a sample x i is the position of that sample in the decreasingly ordered sequence of all samples.
In accordance with the PCC, we can transform the SCC into a distance measure:

Hierarchical clustering
For the hierarchical clustering, we rely on the 'fastcluster' R-package (https://cran.rproject.org/web/packages/fastcluster/index.html). It provides an agglomerative clustering approach, where each observation starts in its own cluster and pairs of clusters are merged until a complete hierarchy is generated. In each step, all the clusters with the smallest distance are merged. In order to identify the minimum distance of two clusters A and B, users can choose from a variety of methods.

Final clusters
The resulting dendrogram contains the complete cluster hierarchy. In order to obtain a set of clusters that is of special interest to the user, the dendrogram has to be cut. To this end, users can define a threshold t ∈ [0, 1] that describes the quantile of merges that should be used to generate a final result. A smaller threshold will result in a stricter clustering with more individual clusters that have a higher similarity between all members.

Over-representation analysis
For the resulting clusters and super-clusters, enrichment analyses are carried out to identify associated pathways. To this end, lets assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set (cluster), which consists of n entries, and l entries in the reference (all investigated genes), which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.

Hypergeometric test
For this analysis, all elements of test set is always part of the reference. For this purpose, the hypergeometric test can be applied to compute a p-value for the analyzed category:

Multiple testing correction
Since for each cluster in our analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods (cf. [4]).

Motivation
The GeneTrail single cell workflow provides functionality to analyze single cell RNA sequencing (scRNA-seq) data sets. Particularly, it is designed to (1) identify for each cell active biological processes, and, (2) subsequently, based on these results, to characterize functional differences between cells. In order to process scRNA-seq data sets, several analysis steps are conducted: First, we optionally normalize the user uploaded RNAseq matrix. Then, we conduct an enrichment analysis for each single-cell to identify pathways associated with the expressed genes. Additionally, we cluster the cells and perform dimension reduction. As a last step, we group individual cells based on cell clusters or uploaded annotations, e.g. tissue or sample id, and find pathways that are associated with specific groups. An overview of the workflow is shown in Figure 1.

Input data
The input for our single cell workflow is a gene expression matrix obtained from a scRNA-Seq run and metadata file. Both can be uploaded as plain text files.

Expression matrix
The expression matrix is assumed to be a tab-separated matrix where columns represent specific cells and each row represents the expression measurements for a particular gene in those cells. The content of the matrix can either be read/UMI counts or normalized expression values.

Metadata
The metadata file can be uploaded as tab-separated text file in which each column provides additional meta information for the cells that should be analyzed. The cell identifier have to match with the column names of the scRNA-Seq matrix. Only cells with an entry in both, the metadata file, and the scRNA-Seq matrix, are analyzed in the subsequent workflow.

Filtering and normalization
As already mentioned, users are able to either upload preprocessed and normalized expression values or raw count data. For raw counts, GeneTrail provides several processing steps that are discussed in the following sections.

Filtering
Current scRNA-seq protocols involve steps to isolate single cells. During this process several artifacts can occur. Some of the cells might not have been separated properly, which could result in doublets, while others could potentially be damaged. GeneTrail provides several filtering procedures that help to remove these artifacts. To this end, we follow the best practices guide by Luecken and Theis [1].

Removing damaged cells
A low number of UMI/read counts, a low number of expressed genes and a high percentage of mitochondrial gene counts might be an indication for cells with a broken membrane, where the mRNA in the cytoplasm might have leaked out. Hence, users can select thresholds for each of those criterions to remove affected cells.

Removing doublets
Doublets can potentially be detected by a very high number of UMI/read counts or by a very high number of expressed genes. Here, users are again able to select thresholds in order to exclude affected cells from further analyses.

Normalization
For our analysis to work properly, we require a within-sample normalization, as we compare the expression values of different genes. This means that depending on the used scRNA-Seq protocol (e.g. for full-length sequencing protocols like SMART-seq2) a gene length normalization is required. Additionally, normalized expression values should be log transformed (log(x+1)). Currently, we offer two methods for gene-length normalization: TPM [4] and GeTMM [6], and two methods for between-sample normalization: CPM [?] and TMM [5]. Based on the selected protocol, we already preselect suitable default parameters. After normalization all expression values are log transformed (log 2 (x + 1)).

Enrichment analysis of individual cells
Next, for each single cell, an enrichment analysis is conducted. To this end, we consider only the most expressed genes of the cell. Therefore, the user can determine a threshold for filtering the genes that should be further analyzed. Alternatively, we use the x genes with the highest normalized expression values. In both cases, we generate one gene set for each single cell, independently. The gene sets are used as test sets in an Over-Representation Analysis (ORA) [2,3]. As a reference set all protein coding genes are used.

Over-representation analysis
Let us assume that we have a biological category (signaling pathway or biological process) that has k entries in our test set, which consists of n entries, and l entries in the reference, which consists of m entries. We can then use one of the following statistical tests to check if the test set has more entries in our category then expected by chance.

Hypergeometric test
If all elements of the test set are also a part of the reference, the hypergeometric test is applied to compute a p-value for the analyzed category:

Fisher's exact test
If the test set contains elements that are not a part of the reference, the Fisher's exact test is applied to compute a p-value for the analyzed category:

Multiple testing correction
Since for each cell in our single cell analysis multiple biological categories are tested simultaneously, we need to adjust the resulting p-values in order the account for the multiple testing problem. For this purpose GeneTrail provides a variety of methods (cf. [3]).  The variable e represents cells in which pathway p is enriched. The variable i represents cells that are contained in group g.

Group comparison of enrichment results
Along with the scRNA-seq expression matrix, a user can upload metadata for each cell (e.g. tissue of origin, sample id, or clinical information). This information can be used to define groups of special interest to the user. Additionally, we cluster the cells in our workflow independently from the given annotations using Seurat3 [7,8]. Subsequently, the clusters can also be used to partition the cells into groups.
In order to find pathways that are associated with certain groups, we perform a χsquared test on the previously generated enrichment results. The intuition behind the χ-squared test is to identify pathways that are more often enriched in cells from one particular group compared to all other groups. To this end, we perform one χ-squared test for each pathway and for each group.
The χ-squared test for two variables can be defined over a 2 × 2 contingency table. Let p be a pathway and g be a group of interest. Furthermore, let n be the total number of cells. We create a 2 × 2 contingency table as shown in Table 1. The columns indicate if pathway p is enriched for a cell and the rows indicate if a cell is contained in group g. The cells in the contingency table (c ij ) count the number of single cells with a combination of these attributes, e.g. the top-left cell of the contingency table (c 00 ) counts the number of single cells in group g for which pathway p is enriched. The χ-squared test then tests if being in group g is statistically independent from having a significant result for pathway p. The χ-squared test statistics is defined as with p ij as the estimated probability of c ij calculated based on the global distribution. We obtained a p-value for this χ-squared statistic using Boost version 1.71.

Marker gene detection
For each group, we compare the gene expression of this group against all other groups in order to find genes that are associated with a specific group. The marker genes for a specific group are calculated as follows: For each group, the expression matrix is separated into cells belonging to the given group and cells that do not belong to the group. For these two new groups of cells, a Wilcoxon rank-sum test, and an independentshrinkage t-test is performed using GeneTrail with default parameters.

Dimension reduction
In order to visualize the cells, we offer several dimension reductions including UMAP and t-SNE. The t-SNE dimension reduction is calcuated with Seurat3 [7,8] using the following parameters: In the normalization step of Seurat3, we use "LogNormalize" as normalization method, and 10000 as scaling factor. To reduce the dimensionality of the data set, only the most variable genes are kept as features. To find the most variable genes, we select "vst" as selection method. The number of most variable genes is a parameter of the GeneTrail single cell workflow and defaults to 2000. Additionally, we choose 50 as perplexity. The UMAP dimension reduction is also calculated with Seurat3 using the same parameters and 30 for the n.neighbours parameter.
Finally, we provide another UMAP reduction representation calculated with Monocle3 [9], which is also the basis for the pseudo-time analysis described in the following section. For the Monocle3 analysis, we first reduced the dimension of the gene expression data set to the first 50 principal components and afterwards performed a dimension reduction to the first two UMAP coordinates using standard parameters of Monocle3.

Pseudo-time analysis
The pseudo-time analysis offered by GeneTrail is performed using the Monocle3 R package with standard parameters and a dimension reduction as discussed above. The basis for the pseudo-time calculation is a graph learning algorithm that tries to learn the ancestry of the cells. Monocle3 provides the coordinates of the graph vertices in their UMAP reduction space. Therefore, it is currently not possible to change the dimension reduction representation to any other than the UMAP representation of Monocle3 for the pseudo-time plot on our results page. If the graph is generated, the pseudo-time is calculated based on the distance to a selected start cell. In our automatized workflow, we currently do not support the manual selection of a start cell, but we estimate the start cell using a script from the Monocle3 documentation.

Enrichment analysis of user-selected groups
On the results page, we offer an on-demand analysis to compare user-selected groups against each other by performing an enrichment analysis. To this end, a user can decide which groups (i.e. cells with the same annotation) should be part of either the sample or the reference set of the enrichment analysis. The annotation can originate from the metadata file or from the clusters that GeneTrail automatically calculated. If more than one group of cells is chosen for the sample (reference) set, the union of the cells in these groups is taken as sample (reference) set. To ensure mutual exclusiveness of the sample and reference set, the cells that are in both sets are removed.
After the assignment of annotation groups to either the reference or the sample set and an appropriate curation of the reference and the sample set as described above, we perform an enrichment analysis to compare the sample to the reference set. We compare the expression of the sample and the reference set as follows: For each gene i, we calculate the mean expression value in the sample set m si and the mean expression value in the reference set m ri . A score for each gene is calculated as s i = m si − m ri . Hence, if the expression values for this gene are on average higher (lower) in the sample set than in the reference set, the gene is assigned a positive (negative) score. On the resulting list of scores, we perform two ORAs, one for the highest 1000 positive genes (representing the sample group) and one for the lowest 1000 negative genes (representing the reference group). The ORA is performed using the GeneTrail workflow with default parameters (see Section 4.1 of this supplement for further information on ORA in GeneTrail).
-Supplement S6 -GeneTrail 3: Parameters for case studies 1 Time series analysis of activated T-cells

Data set (GSE136625)
The data set contains gene expression microarrays of CD4+ T cells from the blood of two human donors. The extracted T cells were in vitro activated and expression profiles were created at 2h intervals from 0h-24h. For both donors and each time point 3 replicates were created.

Microarrays and normalization
Gene expression profiles were measured using Agilent-039494 SurePrint G3 Human GE v2 8x60K Microarrays (Cat. no. G4851B, Agilent Technologies, Santa Clara, CA, USA) according to the manufacturer's instructions. Raw expression values were extracted using the Agilent Feature Extraction Software. The limma package was then used for background correction (method="normexp", offset=16) and normalization between the arrays (method=quantile).

Preprocessing
For our analysis, we only considered the gene expression profiles of Donor 1. For each time point, we used the median value to aggregate all replicates. Here, we analyze a single cell data set of mouse microglia cells from different brain tissues (cerebellum, cortex, hippocampus and striatum). This dataset is part of a comprehensive single cell transcriptome atlas that was designed to study hallmarks of aging in a large variety of mouse tissues and organs (GSE132042, Tabula Muris Senis Project [1]). The subset of microglia cells contains 8330 gene expression profiles of cells from mice with distinct age groups: 3 month, 18 month and 24 month.