-
PDF
- Split View
-
Views
-
Cite
Cite
Judith Bernett, Dominik Krupke, Sepideh Sadegh, Jan Baumbach, Sándor P Fekete, Tim Kacprowski, Markus List, David B Blumenthal, Robust disease module mining via enumeration of diverse prize-collecting Steiner trees, Bioinformatics, Volume 38, Issue 6, March 2022, Pages 1600–1606, https://doi.org/10.1093/bioinformatics/btab876
- Share Icon Share
Abstract
Disease module mining methods (DMMMs) extract subgraphs that constitute candidate disease mechanisms from molecular interaction networks such as protein–protein interaction (PPI) networks. Irrespective of the employed models, DMMMs typically include non-robust steps in their workflows, i.e. the computed subnetworks vary when running the DMMMs multiple times on equivalent input. This lack of robustness has a negative effect on the trustworthiness of the obtained subnetworks and is hence detrimental for the widespread adoption of DMMMs in the biomedical sciences.
To overcome this problem, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). In a large-scale empirical evaluation, we show that ROBUST outperforms competing methods in terms of robustness, scalability and, in most settings, functional relevance of the produced modules, measured via KEGG (Kyoto Encyclopedia of Genes and Genomes) gene set enrichment scores and overlap with DisGeNET disease genes.
A Python 3 implementation and scripts to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust, https://github.com/bionetslab/robust-eval.
Supplementary data are available at Bioinformatics online.
1 Introduction
Over the last decades, high-throughput molecular profiling technologies have generated an immense amount of omics data, enabling the generation of detailed interaction networks. Motivated by the possibility to uncover the pathobiology of complex diseases, the field of network medicine has emerged to untangle these connections and to pinpoint the molecular basis of complex diseases (Barabási et al., 2011; Roy et al., 2014). This task is complicated by the fact that molecular omics data such as gene expression data are generally noisy and overdetermined. Disease-causing alterations such as mutations typically have a cascading effect on the expression of genes and proteins that form the nodes of most interaction networks with typically hundreds or thousands of differentially expressed genes. Additionally, not all of the genes triggering a certain disease might be differentially expressed in an experiment because the expression profiles are limited to a snapshot of the cell state. Therefore, the discovery of disease genes using simple statistical tests is infeasible. Consequently, disease module mining methods (DMMMs) have been developed to combine analyses of gene expression profiles with mining of prior knowledge encoded in protein–protein interaction (PPI) and other networks.
DMMMs try to identify significantly enriched subnetworks by projecting the expression data on a molecular interaction network. Since solving the underlying mathematical models to optimality is typically NP-hard (Ideker et al., 2002), heuristic algorithms are used in practice, where different weight and scoring metrics are applied to the network components. Using these algorithms, subnetworks can be identified that are significantly associated with a certain disease, even when some of the individual nodes have a negligible score. Various DMMMs have been proposed in the past years (Batra et al., 2017; Lazareva et al., 2021). They have enabled new insights into complex diseases like Type-2 diabetes (Fernández-Tajes et al., 2019; Sharma et al., 2018), pulmonary arterial hypertension (Samokhin et al., 2018), coronary heart disease (Wang and Loscalzo, 2018) and asthma (Sharma et al., 2015).
Despite these success stories, existing DMMMs are known to be subject to several limitations. For instance, Levi et al. (2021) have shown that most DMMMs do not fully exploit the information contained in the gene expression data. Lazareva et al. (2021) have demonstrated that most DMMMs mainly learn from the node degrees rather than from the biological knowledge encoded in the edges of the PPI networks.
Here, we draw attention to an additional issue which has not been addressed yet: existing DMMMs lack robustness and are subject to random bias. The reason for this is that all DMMMs we are aware of include non-robust steps in their workflows, although this aspect is often not explicitly mentioned and sometimes not immediately obvious. For instance, for some methods, changing the order of the input data leads to dramatically different results. Other methods show variations of the resulting subnetworks when run multiple times on identical input. This lack of robustness is a major limitation, because reliable output is crucial to achieve a widespread adoption of DMMMs in the biomedical research community: Biomedical scientists without a strong background in computer science or mathematics often find it difficult to trust in tools that do not reliably produce the same output and, when confronted with non-robust disease modules, are therefore often less inclined to invest time and money in downstream wet lab validation. Note that simply ordering the input in some canonical but biologically meaningless way (e.g. by sorting based on gene or protein names) does not resolve this problem but merely hides it.
A straightforward approach for robustifying any DMMM is to run the DMMM n times on shuffled input and then to return the subgraph induced by nodes contained in many of the returned modules. However, this naïve approach has the disadvantage that the runtime increases by a factor of n. Moreover, it is not guaranteed to be effective because the modules obtained for the shuffled input might not be sufficiently diverse to ensure robustness (see Section 3.2 for results showing that this is a real and not only a hypothetical problem).
To address this issue, we present a new DMMM called ROBUST (robust disease module mining via enumeration of diverse prize-collecting Steiner trees). Unlike the naïve approach, ROBUST ensures robustness by enumerating pairwise diverse rather than merely pairwise non-identical disease modules. Large-scale tests on data for 829 diseases show that, unlike all tested competitors, ROBUST achieves almost perfect robustness (best-possible robustness already at the first quartile). ROBUST is also faster than its competitors and manages to compute disease modules for up to 400 seeds in <30 s. Tests on gene expression data for amyotrophic lateral sclerosis (ALS), non-small cell lung cancer (LC), ulcerative colitis (UC), Chron’s disease (CD) and Huntington’s disease (HD) demonstrate that, in most settings, ROBUST outperforms its competitors in terms of the returned modules’ functional relevance, which we measured via KEGG (Kanehisa et al., 2016) gene set enrichment w.r.t. known disease-associated pathways and overlap with DisGeNET (Piñero et al., 2020) disease genes. Finally, a case study in multiple sclerosis (MS) shows how ROBUST can be used for hypothesis generation.
2 Materials and methods
2.1 Modeling disease modules via generalized Steiner trees
Two strategic decisions have to be made when designing a new DMMM. First, one has to decide which input should be expected. In addition to a PPI network, existing DMMMs use various types of input data such as normalized expression data (Larsen et al., 2020; Ma et al., 2011; Nacu et al., 2007), gene scores (Barel and Herwig, 2020; Reyna et al., 2018), sorted lists of genes (Breitling et al., 2004), indicator matrices of differentially expressed genes (List et al., 2016) or binary input in the form of sets of disease-associated or differentially expressed seed genes (Ding et al., 2018; Ghiassian et al., 2015; Levi et al., 2021; Sadegh et al., 2020). For ROBUST, we chose the latter option for the following reasons:
Sets of seed genes are very user-friendly input. They can be computed via standard differential gene expression analysis or be obtained from public databases such as OMIM (Online Mendelian Inheritance in Man) (Amberger et al., 2019) or DisGeNET (Piñero et al., 2020), which provide disease–gene associations obtained from genome-wide association studies (GWAS).
Levi et al. (2021) have shown that DMMMs using seed sets as input tend to outperform DMMMs operating on non-binary input data.
Nonetheless, there might be scenarios where binarization is not desirable because of the arbitrariness in selecting the cutoff and the resulting loss of information. In such settings, ROBUST is not applicable.
The second question is how the disease module mining problem should be modeled mathematically. Informally, our model can be viewed as a generalized minimum-weight Steiner tree (MWST) model (relaxation is explained below). Recall that an MWST for a weighted network and a set of seed nodes is a tree with , and minimum total weight . Steiner trees have been used for disease module mining before, e.g. by the DMMMs DOMINO (Levi et al., 2021) and MuST (Sadegh et al., 2020). Computing MWSTs is NP-hard, but approximation algorithms exist, e.g. the classical 2-approximation by Kou et al. (1981) or the currently best 1.39-approximation by Byrka et al. (2013).
From a biological point of view, using MWSTs to model the disease module mining problem is promising. Functionally related genes or proteins tend to be close to each other in the molecular interaction network, and it could be shown that pairwise shortest paths of known disease genes show a considerable left shift in their distribution compared to the random expectation (Menche et al., 2015). A reasonable hypothesis is hence that the shortest paths between these disease genes overlap with causal molecular pathways (Barabási et al., 2011). Since MWSTs can be viewed as generalizations of shortest paths to settings with more than two endpoints, a disease module constructed using MWSTs can be expected to cover a large fraction of the disease-relevant molecular pathways.
As mentioned above, we use a generalized MWST model, which means that we do not strictly enforce but allow that some seeds are left uncovered (see Section 2.3 for the formal specification of our model). This is because, in the context of disease module mining, the seeds are potentially noisy due to false positives in GWAS or differential gene expression analysis. Moreover, we are eventually interested in the subgraph induced by the node set of the tree rather than in T itself. The reason is that also edges between nodes from VT which are not contained in ET might pinpoint to causal disease mechanisms and are hence potentially of interest.
2.2 Ensuring robustness via enumeration with diversity
The main limitation ROBUST is designed to overcome is the lack of robustness of existing DMMMs. However, our generalized MWST model alone does not ensure this. For a given seed set S, the PPI network G typically contains multiple near-optimal generalized Steiner trees. If we simply returned the subgraph induced by the node set of one cheap generalized Steiner tree, the output would hence again depend on the random storage order of the input, hampering the robustness of our approach.
To address this problem, our DMMM ROBUST is designed to provide a solution for the following problem specification: Given a weighted network and a set of seed nodes , compute an induced subgraph , where contains nodes that appear in many diverse near-optimal generalized Steiner trees for S. ROBUST’s overall approach is detailed in Algorithm 1 and visualized in Figure 1. Instead of computing just one near-optimal generalized Steiner tree, we enumerate up to n of them and ensure that their node sets are pairwise diverse (see Section 2.3). We then return the subgraph induced in G by nodes contained in at least % pairwise diverse Steiner trees, where both and are hyper-parameters.

Visualization of how enumeration with diversity ensures robustness. The bounding circle represents the set of all nodes contained in any near-optimal generalized Steiner tree. The sets represent the node sets of three pairwise diverse near-optimal generalized Steiner trees. For , the set M corresponds to the intersection of the sets . The sets represent the node sets of three randomly sampled near-optimal generalized Steiner trees. Nodes contained in M are contained in almost all of the sets
Two aspects should be highlighted at this point: First, the subgraph is in general not connected and its connected components hence potentially represent disjoint or complementary disease mechanisms. To allow separate downstream analyses, our implementation therefore labels ’s connected components via node attributes in the output file. Second, note that the above specification of the problem solved by ROBUST is imprecise, as we did not formally define the qualifiers ‘diverse’, ‘near-optimal’ and ‘generalized’. Several possible formal specifications are discussed in the Supplementary Material.
Input: Graph , seeds , parameters .
Output: Robust disease module for seeds S.
1 ;
2 ;
3 return ;
2.3 Enumerating cheap and diverse generalized Steiner trees
Let us consider how a set of diverse, low-weight networks that connect most of the seed nodes (generalized Steiner trees) can be computed. Naïvely, we can compute a Steiner tree T on G to obtain our first network. To obtain a different network, we simply remove a Steiner node or edge in T from G and compute a new Steiner tree that now differs in at least one position from T. As mentioned above, the currently best-known algorithm for the MWST problem is the 1.39-approximation by Byrka et al. (2013), but even with a better algorithm, the results may not be as hoped: If the removed edge or node is in a dense part of the graph, it can easily be circumvented and the resulting solution will nearly be the same. If it is in a sparse part and an important connection, the resulting solution will be expensive. This naïve approach is employed by the DMMM MuST (Sadegh et al., 2020), which uses the 2-approximation for MWST by Kou et al. (1981), iteratively removes Steiner edges and eventually returns the union of all computed Steiner trees.
The alternative we propose is to not just focus on one seed node which is removed but instead to make all of the previously used nodes less attractive depending on how often they have been used. To achieve this, we use prize-collecting Steiner trees (PCSTs) where we assign every seed a high value and every other node a low but not negligible value. If a non-seed node is returned in a solution, we decrease its value to make it less attractive for future solutions. The seeds’ high values encourage a PCST algorithm to include them in the solution. The low decreasing values of the other nodes encourage the algorithm to integrate less often used nodes. By keeping the values below the edge costs, this approach avoids randomly integrating nodes at the cost of a more expensive network. More details on the seed values are provided in the Supplementary Material.
Computing an optimal PCST is unfortunately also NP-hard. Therefore, we use the primal–dual approximation algorithm by Goemans and Williamson (1995) and an implementation by Hegde et al. (2015). It has a guaranteed approximation factor of at most two and a runtime complexity of , where d refers to the encoding size in bits for the weight and values. In practice, this algorithm yields very natural solutions, because it is based on a linear programming relaxation which captures a lot of structure and not only the objective value (see Supplementary Material for details). The implementation is remarkably fast and solves instances with multiple hundreds of thousand edges within seconds (often even less than a second), as shown by Hegde et al. (2014).
Let pcst_apx be a PCST algorithm that receives a graph , positive edge weights , and non-negative node values and returns a tree with and , minimizing . Note that, in the context of disease module mining, edges are typically unweighted, i.e. w(e) = 1 holds for all . However, some DMMMs use edge weights that penalize edges toward high-degree nodes in the PPI network (Sadegh et al., 2020). While we have designed ROBUST with unweighted edges in mind, we here present the more general weighted version.
Our proposed algorithm is defined in Algorithm 2. As input, it expects a graph , a seed set S, edge weights w, a number of desired trees n, as well as tuning parameters and explained below. The first step is to define the initial values p to be passed to pcst_apx. To give the algorithm a high incentive to integrate all seeds, we determine their value based on the diameter of the graph and the maximum edge weight (line 1). The initial values of the non-seeds are defined as α times the minimum edge weight (line 1). Note that, in the unweighted case, both the minimum and the maximum edge weight equals 1.
Input: Graph , seeds , edge weights
, parameters .
Output: Set of diverse PCST node sets.
1 for do ;
2 for do ;
3 ;
4 while do
5 ;
6 if then break
7 ;
8 for do
9 return
2.4 Influence of the hyper-parameters
ROBUST has four hyper-parameters: the desired number of PCSTs n, the threshold τ, and the tuning parameters α and β. The effects of these parameters can be summarized as follows:
Intuitively, the desired number of trees controls the extent to which the disease module computed by ROBUST covers the space of all near-optimal generalized Steiner trees. Setting n to a rather large value is hence desirable but detrimental for the runtime.
The threshold provides a tradeoff between robustness and explorativeness. The larger τ, the more robust but less explorative the disease module computed by ROBUST.
The parameter modifies the initial values for integrating non-seeds into the tree. This implicitly represents the allowed diversion from the cheapest Steiner tree. For α = 0, the algorithm would only return the best Steiner tree it can find but not allow any diversion from it. The larger α, the more diverse and but also more expensive the returned Steiner trees are allowed to become.
The parameter modifies the decrease of the values for integrating non-seeds into the trees. Setting β = 0 will only give a value to a non-seed until its first appearance in one tree. This can quickly exhaust the available non-seeds and then has the same problem as α = 0. A too high value, on the other hand, might not be able to reduce the values sufficiently to make the other non-seeds more attractive. Hence, more trees need to be generated to achieve diversity.
3 Results and discussion
3.1 Compared methods
We compared ROBUST to the state-of-the-art DMMMs DIAMOnD (Ghiassian et al., 2015), MuST (Sadegh et al., 2020) and DOMINO (Levi et al., 2021). These methods were selected for the following reasons:
They all expect binary input (i.e. lists of differentially expressed or disease-associated seed genes) and are hence directly comparable to our method ROBUST.
DOMINO has been shown to outperform other DMMMs in two independent studies (Lazareva et al., 2021; Levi et al., 2021) and can hence be considered to be one of the best available methods.
Based on the number of citations, DIAMOnD is arguably one of the most widely used DMMMs.
MuST serves as a baseline model for extracting disease modules via Steiner trees without the improvements of ROBUST.
Moreover, we compared ROBUST to a baseline implementing the naïve approach at robustification outlined in Section 1. More precisely, instead of enumerating diverse PCSTs as detailed above, we enumerate multiple Steiner trees by simply shuffling the input data and running the classical 2-approximation algorithm by Kou et al. (1981) several times. In the sequel, this naïve baseline is called R-MuST (randomized MuST). An AIMe report (Matschinske et al., 2021) with further details on the empirical evaluation is available at https://aime-registry.org/report/VM0hhS.
3.2 Robustness
3.2.1 Protocol and data used for robustness tests
The methods were tested on a human PPI network obtained from IID (Integrated Interactions Database) (Kotlyar et al., 2019) filtered for experimentally validated interactions. The network consists of 329 215 edges between 17 666 proteins. Sets of disease-associated seed genes were constructed for 929 diseases by merging disease–gene associations from OMIM (Amberger et al., 2019) and DisGeNET (Piñero et al., 2020). For the hyper-parameter evaluation of ROBUST (Section 3.2.2), 100 of the seed sets were used and subsequently excluded for the comparison of ROBUST to its competitors (Section 3.2.3).
3.2.2 Effect of hyper-parameters
For testing ROBUST’s robustness w.r.t. the hyper-parameters, we varied and . Supplementary Figure S1 shows the full results. While increasing α significantly deteriorated the robustness, increasing β marginally improved it. Therefore, we focus on the results for and , which are shown in Figure 2. Unsurprisingly, we observe that the robustness improved when increasing the threshold τ and the number of trees n. However, for n = 30, we observe large robustness coefficients already for small values of τ. Keeping τ small is prima facie desirable as it allows ROBUST to compute more exploratory disease modules. Moreover, ROBUSTS’s runtime requirements increase only very moderately with increasing n (see Section 3.4). For these reasons, we selected the hyper-parameter setup for all further experiments. Note (i) that the 100 seed sets used to select these hyper-parameters were not used for evaluating ROBUST’s robustness in comparison to its competitors and (ii) that we tuned the hyper-parameters only for robustness and not for functional relevance.

Effect of the number of trees n and the threshold τ on the robustness of ROBUST for and
3.2.3 Robustness in comparison to competitors
Figure 3 shows the distribution of the 829 mean Jaccard indices for each of the compared DMMMs. Of all the tested DMMMs, only ROBUST yielded almost perfectly robust disease modules with a robustness coefficient of at the first quartile. ROBUST is clearly superior to its precursor MuST, to the naïve baseline implementation R-MuST, and, most importantly, also to the state-of-the-art DMMM DOMINO. DIAMOnD yielded remarkably high robustness coefficients, especially considering that its hyper-parameters were not tuned for robustness. Nonetheless, its robustness coefficients were still statistically significantly lower than the ones obtained for ROBUST (Table 1). The superiority of ROBUST to MuST and R-MuST shows that it is indeed necessary to ensure diversity when enumerating (generalized) Steiner trees. Merely enumerating pairwise different trees does not yield the desired robustness. More generally, the rather bad performance of R-MuST shows that the above-mentioned naïve approach for robustification (run DMMM n times on shuffled input and then return subgraph induced by nodes contained in many modules) is not guaranteed to be effective.

Robustness of ROBUST with hyper-parameter setup in comparison to its competitors. Like ROBUST, MuST and R-MuST were setup to run with
P-values obtained by comparing the robustness coefficients from two DMMMs via the Mann–Whitney U test (alternative hypothesis: DMMM 1 yields larger robustness coefficients than DMMM 2)
DMMM 1 . | DMMM 2 . | P-value . |
---|---|---|
ROBUST | DOMINO | |
ROBUST | MuST | |
ROBUST | R-MuST | |
ROBUST | DIAMOnD |
DMMM 1 . | DMMM 2 . | P-value . |
---|---|---|
ROBUST | DOMINO | |
ROBUST | MuST | |
ROBUST | R-MuST | |
ROBUST | DIAMOnD |
Note: Note that the P-values should be interpreted carefully, because the Mann–Whitney U test is a rank-based test and can hence yield very small P-values even if the quantitative differences between the compared populations are small.
P-values obtained by comparing the robustness coefficients from two DMMMs via the Mann–Whitney U test (alternative hypothesis: DMMM 1 yields larger robustness coefficients than DMMM 2)
DMMM 1 . | DMMM 2 . | P-value . |
---|---|---|
ROBUST | DOMINO | |
ROBUST | MuST | |
ROBUST | R-MuST | |
ROBUST | DIAMOnD |
DMMM 1 . | DMMM 2 . | P-value . |
---|---|---|
ROBUST | DOMINO | |
ROBUST | MuST | |
ROBUST | R-MuST | |
ROBUST | DIAMOnD |
Note: Note that the P-values should be interpreted carefully, because the Mann–Whitney U test is a rank-based test and can hence yield very small P-values even if the quantitative differences between the compared populations are small.
3.3 Functional relevance
3.3.1 Protocol and data used for functional relevance tests
Functional relevance tests were conducted by implementing custom wrappers for the DMMM test suite introduced by Lazareva et al. (2021). In the test suite, gene expression datasets with case/control information for five complex diseases are used, namely ALS, non-small cell LC, UC, CD and HD. The seed genes are obtained by applying a two-sided Mann–Whitney U test on the case/control expression vectors and extracting all genes with Bonferroni-adjusted P-values <0.001. Each set of seed genes was projected onto one of the five widely used PPI networks BioGRID (Oughtred et al., 2019), APID (Agile Protein Interactomes DataServer) (Alonso-Lopez et al., 2016, 2019), STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) (Szklarczyk et al., 2019) with high confidence interactions only, HPRD (Human Protein Reference Database) (Keshava Prasad et al., 2009) and IID (Kotlyar et al., 2019). Functional relevance was evaluated via gene set enrichment P-values w.r.t. KEGG pathways (Kanehisa et al., 2016) corresponding to the diseases and via overlap coefficients with the disease-associated DisGeNET (Piñero et al., 2020) gene sets. For more details, we refer to Lazareva et al. (2021).
3.3.2 Functional relevance in comparison to competitors
Figure 4 shows the distributions of the functional relevance scores for ROBUST, MuST, DIAMOnD and DOMINO run on the five disease networks. The baseline implementation R-MuST was excluded from the tests due to high runtime requirements and poor robustness results. Overall, ROBUST outperformed the other three DMMMs w.r.t. both functional relevance scores. The CD dataset is the only case where DIAMOnD clearly yielded better results than ROBUST.

Distribution of the functional relevance scores for ROBUST, MuST, DIAMOnD and DOMINO. (A) Overlap coefficients of the disease modules and the DisGeNET disease genes, split by disease. (B) KEGG gene set enrichment P-values split by disease. (C) Overlap coefficient distributions over all networks and disease seed sets. (D) KEGG gene set enrichment distributions over all networks and disease seed sets
The test suite introduced by Lazareva et al. (2021) also supports permutation tests to assess to which extent DMMMs are potentially biased toward hub nodes in the PPI networks (for details, we refer to the original publication). This was also tested for ROBUST, DIAMOnD and DOMINO, the results are shown in Supplementary Figure S2 (MuST was excluded because of its high runtime and the large number of runs required for the permutation tests). In this dimension, ROBUST performed similarly to DIAMOnD but was outperformed by DOMINO. To reduce the risk of including false positives into the solution, it might hence be advisable to run ROBUST on context-specific networks, e.g. by keeping edges only for PPIs experimentally validated in tissue relevant for the disease of interest. We followed this approach for our case study in MS (Section 3.5).
3.4 Scalability
3.4.1 Protocol and data used for scalability tests
As for the robustness tests, we used a human PPI network obtained from IID, filtered based on experimental validation. We randomly generated seed sets of sizes , ran all compared DMMMs on all of them and measured the runtimes. For ROBUST, MuST and R-MuST, we additionally varied the number of trees . For all DMMMs except MuST and R-MuST and each seed set size k, runtimes were measured on 10 random seed sets of size k. MuST and R-MuST were evaluated only on one seed set for each k, because of their high runtime requirements.
3.4. 2 Scalability in comparison to competitors
Figure 5 shows the results of the scalability tests. The most important observations are the following:

Runtime of the DMMMs versus number of seeds (all DMMMs) and number of seeds and trees (MuST, R-MuST and ROBUST). The line plots visualize the mean runtimes
MuST and R-MuST are around 2 orders of magnitude slower than DIAMOnD, DOMINO and ROBUST.
While, for ROBUST and DOMINO, the runtime increases sublinearly with the number of seeds, we observe a linear increase in runtime for DIAMOnD.
Increasing the number of trees affects ROBUST’s runtime only very moderately.
In sum, ROBUST hence exhibits the best runtime behavior even if the number of trees is set to n = 30 as suggested above: For small seed sets, ROBUST is approximately as fast as DIAMOnD and around five times faster than DOMINO. For large seed sets, it is around twice as fast as DIAMOnD and between three and four times faster than DOMINO.
3.5 Case study in MS
In addition to the quantitative evaluation reported in the previous sections, we performed a case study in MS to showcase how to use ROBUST for hypothesis generation. First, we constructed a context-specific PPI network from IID by filtering for the interactions experimentally validated in brain tissue. Then, proteins associated with MS were obtained by merging DisGeNet and OMIM annotations. This yielded 42 seeds, 26 of which were contained in the context-specific network.
Running ROBUST on these 26 seeds yielded a disease module with 90 additional proteins (Supplementary Fig. S3), including galectin-1, which was found in each of the 30 trees. It has been shown that galectin-1 plays an important regulatory role in MS patients (Starossom et al., 2012). We then took a closer look at the 2-hop neighborhood of galectin-1 within the computed diseases module (Supplementary Fig. S4). In this submodule, we found thioredoxin (TXN), peroxiredoxin-2 (PRDX2), the mitochondrial thioredoxin-dependent peroxide reductase (PRDX3) and DJ-1 (extended findings in the Supplementary Material).
TXN, PRDX2, PRDX3 and DJ-1 are antioxidant molecules related to oxidative stress, a sign of various neurological disorders including MS (Liu et al., 2020). TXN has been found to be significantly upregulated in MS patients compared to healthy controls (Mahmoudian et al., 2017). PRDX2 and PRDX3 are enzymes which reduce H2O2 and hydroperoxides using TXN as substrate (Cao et al., 2007; Kamariah et al., 2016). PRDX2 was shown to be upregulated in white matter MS lesions (Voigt et al., 2017). While DJ-1 is not directly linked to TXN, the two molecules share downstream targets and it has been suggested that there is some cross-talk between these two systems (Raninga et al., 2014) and various studies have linked DJ-1 to MS (Hirotani et al., 2008; van Horssen et al., 2010). These findings show how ROBUST can identify a submodule related to oxidative stress in MS whose participants share common pathways.
4 Conclusions, limitations and outlook
In this article, we have presented a novel DMMM called ROBUST which, unlike existing approaches, computes almost perfectly robust disease modules when run multiple times on equivalent input. ROBUST is also faster than its competitors and, in most settings, outperforms them in terms of functional relevance of the computed modules.
We conclude this article by pointing out three limitations of ROBUST which constitute challenges for future work. First, ROBUST supports only binary input and so future work is needed to overcome the robustness deficit in disease module detection with continuous input. Second, ROBUST is outperformed by DOMINO w.r.t. resistance to hub node bias and it hence remains an open algorithmic challenge to design a DMMM which is both immune to hub node bias and robust w.r.t. random storage order. Third, it would be interesting from a theoretical point of view to investigate whether the Steiner tree enumeration problem underlying ROBUST can be formalized such that it allows for approximation algorithms with provable approximation guarantees.
Funding
This work received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreements 826078 (J.B.) and 777111 (S.S., T.K., J.B.). This publication reflects only the authors’ view, and the European Commission is not responsible for any use that may be made of the information it contains. This work was supported by the German Federal Ministry of Education and Research (BMBF) under Grant 01ZX1908A (J.B., T.K., M.L.), Grant 01ZX1910D (J.B.) and Grant 031L0214A (J.B.).
Data availability
All data required to reproduce the results reported in this article are available on GitHub: https://github.com/bionetslab/robust-eval.
Conflict of Interest: none declared.
References
Author notes
The authors wish it to be known that, in their opinion, Judith Bernett and Dominik Krupke authors contributed equally.