PathBIX—a web server for network-based pathway annotation with adaptive null models

ABSTRACT Motivation Pathway annotation is a vital tool for interpreting and giving meaning to experimental data in life sciences. Numerous tools exist for this task, where the most recent generation of pathway enrichment analysis tools, network-based methods, utilize biological networks to gain a richer source of information as a basis of the analysis than merely the gene content. Network-based methods use the network crosstalk between the query gene set and the genes in known pathways, and compare this to a null model of random expectation. Results We developed PathBIX, a novel web application for network-based pathway analysis, based on the recently published ANUBIX algorithm which has been shown to be more accurate than previous network-based methods. The PathBIX website performs pathway annotation for 21 species, and utilizes prefetched and preprocessed network data from FunCoup 5.0 networks and pathway data from three databases: KEGG, Reactome, and WikiPathways. Availability https://pathbix.sbc.su.se/ Contact erik.sonnhammer@scilifelab.se Supplementary information Supplementary data are available at Bioinformatics Advances online.


Introduction
Data generation is often not the limiting factor to gain biological insight. High-throughput experiments provide a plethora of data, which in turn yields a list of genes or proteins that are related to the studied phenotype. However, this list alone is not enough to understand the complex biological mechanisms that are involved in a given condition. An indispensable approach to resolve this issue is pathway enrichment analysis, which can reveal activated biological processes by detecting patterns of gene groups in the list. This is done by comparing to functionally characterized and curated gene sets, known as pathways, in databases such as KEGG (Kanehisa et al., 2021), Reactome (Fabregat et al., 2018), WikiPathways (Martens et al., 2021), NCIPathways (Schaefer et al., 2009) andPathway Commons (Rodchenkov et al., 2020).
The importance of pathway enrichment analysis can be appreciated from the large number of methods available (Khatri et al., 2012;Nguyen et al., 2019). Pathway enrichment methods can be divided into two main approaches: methods that use a score for every gene in the genome, and methods that only use a set of relevant genes. If gene expression data are available, the former approach is reasonable. However, experimental data often do not offer genome-wide per-gene scores, for instance from Genome-Wide Association Studies (GWAS), protein-protein interactions, curated gene-disease associations or gene-drug associations . For such cases, relevant gene set methods are more suitable since they do not require scoring every gene. Among such methods, the most widely used web server, DAVID (Huang et al., 2009), relies on the gene overlap between the gene set and the pathway. Pathway enrichment is assessed by comparing the observed overlap with the overlap that is expected by chance with a modified Fisher's exact test called the EASE score. Despite being the most widely used approach, overlap-based methods have some important limitations. They assume gene independence which is often not true in biology since genes interact between each other in complex biological systems. Moreover, they do not take into account that some genes in the gene set may have a more important role in the pathway than others. Additionally, these methods are highly dependent on the coverage of pathway knowledge, which is still incomplete, leading to false negatives (Castresana-Aguirre and Sonnhammer, 2020; Ogris et al., 2017). For instance, if half of a pathway is used as a query against the other half, it would not be possible to detect enrichment.
Sensitivity can be substantially boosted by network-based approaches that do not rely on the overlap between query set and pathway but instead measure the enrichment of network crosstalk between them. To this end, they employ a functional association network of genes such as FunCoup (Persson et al., 2021) or STRING (Szklarczyk et al., 2021). By using interactions between query gene set and pathway instead of the gene overlap, the pathway incompleteness problem is diminished. Some of these methods are implemented in web servers, for example, PathwAX (Ogris et al., 2016), EviNet (Jeggari et al., 2018) and Enrichnet (Glaab et al., 2012). However, 1 these web servers have some limitations. Enrichnet does not estimate the statistical significance of the crosstalk, and PathwAX and EviNet sometimes suffer from high false-positive rates, which can be observed when using random query gene sets (Castresana-Aguirre and Sonnhammer, 2020). The reason for false positives is that the null model of these methods assumes that pathways follow the statistical properties of random gene sets, which is not true since pathways tend to contain functionally related genes. This issue has been addressed by the novel method ANUBIX (Castresana-Aguirre and Sonnhammer, 2020), which scores random gene sets against the real pathways to build a custom null model for each pathway. By keeping pathway properties intact it can model random crosstalk much better than previous methods, thereby decreasing the false-positive rate while keeping a high true-positive rate.
To make ANUBIX more accessible to the scientific community, we developed a web server named PathBIX, featuring a graphical web interface with preprocessed data in front of an ANUBIX backend. It supports 21 species and three pathway databases: KEGG, Reactome and WikiPathways. To visualize the results, the connectivity of each query gene to each pathway is shown as a heatmap matrix, and an interactive bipartite network is provided for each query-pathway pair.

Implementation of web application
The PathBIX application is built with a frontend application containing the graphical interface, using the Angular (v.9) framework (https: //angular.io) , and a backend application containing the application logic, built as a Python (v.3.7) (Van Rossum and Drake, 2009) rest API. The Python API runs the R (v.3.6) (R Core Team, 2017) version of the ANUBIX (Castresana-Aguirre and Sonnhammer, 2020) algorithm to perform the pathway enrichment analysis. Only genes present in the FunCoup networks are used as background to sample genes for the null model. The Python API uses precalculated data stored in a Postgres (v.10) (Stonebraker and Rowe, 1986) database. The database and application are deployed and run in Podman (v.2.0.5) (https: //podman.io) containers, and the containers are run on a CentOS8 virtual machine.

Networks and species
The networks used to calculate the crosstalk between the query and the pathway gene sets were collected from FunCoup 5.0 (Persson et al., 2021), a comprehensive database of functional association networks built on redundancy-weighted naive Bayesian integration of 11 evidence types. PathBIX uses all 21 species available in the latest FunCoup release. In PathBIX, the networks are used with a cutoff on the link confidence in order to ensure high reliability on the links. The default link confidence cutoff used for the networks is 0.8, but the user may select even higher thresholds of 0.95 or 0.99. The number of links and genes included in the networks for each PathBIX species, using the default cutoff of 0.8 can be found in Supplementary Table S1.
Gene identifier mappings were extracted from the FunCoup 5.0 database and incorporated into PathBIX, allowing the user to input and label their results as either Ensembl Gene IDs, Ensembl Protein IDs, Ensembl Transcript IDs, Gene symbols, UniProt ID, NCBI gene ID or a selection of species-specific database identifiers, depending on the available identifier types for each gene. These mappings originate from idmapping files from UniProt (v.2020_01) (UniProt Consortium, 2021) and the EBI Quest for Orthologs reference proteomes (v.2020_04) (Altenhoff et al., 2020). The networks and mapping data retrieved from FunCoup will be updated in the PathBix resource when a new version of the FunCoup database is available.

Pathway databases
PathBIX uses information on pathway gene sets from three different pathway databases, KEGG (Kanehisa et al., 2021) (v.96.0), Reactome (v.75) (Fabregat et al., 2018) and WikiPathways (Martens et al., 2021) (v.20201110). These databases were chosen as they are highly cited and complementary open-source pathway databases (Mubeen et al., 2020). The sets of pathway genes, together with the pathway identifier, pathway name and pathway class were fetched from the respective pathway database website and processed so that all gene identifiers were translated to Ensembl IDs. For WikiPathways, data included in the set of 'reviewed and approved' pathways were processed and translated, and species resulting in having more than five pathways were included in the PathBIX pathway collection. Pathways having a WikiPathway category were assigned this category name as their pathway class. Reactome pathways are hierarchically distributed, with the lowest level pathways being very specific. Once mapped to FunCoup, 46% of the pathway leaf nodes have less than 10 genes. To avoid very small pathways and reduce the number of pathways, we flattened Reactome's hierarchy by merging child pathways with their parents until a pathway size distribution similar to KEGG was obtained for each species. We intend to perform a yearly update to the PathBIX database to ensure that the pathway data used in the analysis is recent and up to date.

Examples
For each species in PathBIX, at least one example gene set is provided on the website, in order to make it easy for users to see what the expected input looks like, and to test the functionality of the website. The examples were fetched from MSigDB (Subramanian et al., 2005) for Homo sapiens and Mus musculus, Gramene ontology (Jaiswal et al., 2002) for Arabidopsis thaliana, PomBase (Lock et al., 2020) for Schizosaccharomyces pombe and Gene Ontology (Carbon et al., 2009) for the remaining species. Examples were selected so that a sufficient amount of genes could be mapped and identified in the networks to give reliable and relevant results. control what is shown in the table, for example that FWER must be lower than 0.05. On the right, the colored rectangular areas form a matrix that shows the crosstalk of each query gene against each found pathway, where the intensity of the color increases with higher crosstalk. Each column in the matrix represents a query gene, and information about the gene's ID and its crosstalk to the pathway in question is shown in a mouseover box (black background). Green areas indicate genes with crosstalk to the pathway, and blue rectangles represent query genes shared with the pathway Fig. 2. Detail of the PathBIX results page. The network viewer showing the crosstalk between pathway number 5 in Figure 1, 'Breast cancer' and the query gene set CHIN_BREAST_CANCER_COPY_NUMBER_UP. Nodes marked in green represent query genes while nodes marked in blue represent pathway genes. Nodes colored green and blue are genes that overlap between the query and the pathway gene set. The network viewer is toggled on or off by clicking on any pathway in the results table

Data in PathBIX
The integration of data from the three pathway databases, KEGG, Reactome and WikiPathways into PathBIX resulted in a total of 7856 pathways in the PathBIX database. The number of pathways included in PathBIX for each of the pathway databases, for each species, can be seen in Figure 3. The number of pathways ranges between 329 pathways for H.sapiens and 74 pathways for Methanocaldococcus jannaschii in KEGG, 289 pathways for H.sapiens and 49 pathways for Plasmodium falciparum in Reactome, and 622 pathways for H.sapiens and 6 pathways for Escherichia coli in WikiPathways. For Reactome and WikiPathways, pathway information is missing for some of the species present in PathBIX, while KEGG pathways are available for all species. The number of unique pathway genes included in the PathBIX database varies between the pathway databases, and since ANUBIX is basing the analysis on the information in the FunCoup networks, only pathway genes included in the networks are utilized in the tool. To visualize how many of the unique pathway genes are represented in the FunCoup network, we calculated the fraction of genes in the networks represented as genes in a pathway for each of the pathway databases (see Fig. 4). Here, we could see that mapping the genes for H.sapiens retrieved from KEGG on the FunCoup network, covers close to 50% of the genes in the network with a cutoff of 0.8. The number of pathways and unique pathway genes included in PathBIX from each pathway database, for each species, can be found in Supplementary Table S1.

Using PathBIX
To illustrate the usability of PathBIX, we ran the analysis with the gene set CHIN_BREAST_CANCER_COPY_NUMBER_UP for H.sapiens against the KEGG database, with a network cutoff of 0.95. This gene set is provided as one of the example gene sets in the tool. The results page of PathBIX when running this analysis can be found in Figure 1. This gene set is obtained from MsigDB (Subramanian et al., 2005) and includes 28 genes that were observed to have copy number aberration gains in more than 15% of 148 primary breast cancer tumor samples (Chin et al., 2007). The gene set originating from the common regions of gains was seen in the study to include oncogenes like ERBB2 as well as tumor suppressor genes, which are common genes not only in breast cancer but in many different cancer types. The PathBIX analysis resulted in 21 pathways having an FWER lower than 0.05. Among the top 10 results, seven pathways were cancer pathways, with 'Breast cancer' being placed among the top five at an FWER of 2.35e-3, closely following pathways 'Prostate cancer', 'Pathways in cancer' and 'Gastric cancer', the top-scoring pathway with an FWER of 1.91e-4. Among the top pathways, 'Thyroid Hormone Signaling Pathway' is found. Several studies have shown that thyroid hormones can play an important role in cancer cell proliferation, and that hyperthyroidism is correlated to several cancer types, including breast cancer (Liu et al., 2019;Uzair et al., 2019). The occurrence of the 'Cell cycle' pathway among the top 10 pathways can be explained by the cell cycle activity often being aberrant in human cancers, and if dysregulated can cause the abnormal cell growth seen in different types of cancer (Otto and Sicinski, 2017).

Discussion
Here we present PathBIX, a web application for reliable crosstalkbased pathway enrichment analysis. Other websites for networkbased pathway enrichment rely on methods that are prone to find false positives (Castresana-Aguirre and Sonnhammer, 2020) because their statistical models assume that pathways can be treated like random gene sets. This applies to EviNET, which is based on NEArender and computes the crosstalk significance by only taking the degrees of the query gene set, the pathway and the network into account. PathwAX uses the BinoX algorithm which computes the crosstalk significance by randomizing the networks and thereby loses some of the properties of the pathway. The main strength of PathBIX resides in providing more reliable results, by using a robust method called ANUBIX which had a much lower false-positive rate than previous methods when benchmarked with random gene sets (Castresana-Aguirre and Sonnhammer, 2020). ANUBIX keeps each pathway's topology intact and builds its null model by sampling Fig. 3. Number of pathways included in PathBIX, from KEGG, Reactome and WikiPathways, for each species in PathBIX random gene sets of the same size as the query to compute the crosstalk significance to the pathway.
Because different pathway databases use different pathway boundary definitions (Domingo-Fernández et al., 2019), major differences can be obtained in the results from the pathway enrichment analysis depending on the database selected for the analysis. In order to take this into account, and include a wider variety of pathway definitions, PathBIX includes three different pathway databases, KEGG, Reactome and WikiPathways. For certain queries, it can happen that one database yields very few enriched pathways while another one yields dozens. While many pathways overlap considerably between databases, they have somewhat different focuses. For instance, KEGG contains many disease pathways, and Reactome many signaling pathways. This can guide the users to select the pathway database best suiting their needs, as well as give a wider range of annotations by running and comparing the results from different pathway databases.