MultiNEP: a multi-omics network enhancement framework for prioritizing disease genes and metabolites simultaneously

Abstract Motivation Many studies have successfully used network information to prioritize candidate omics profiles associated with diseases. The metabolome, as the link between genotypes and phenotypes, has accumulated growing attention. Using a ”multi-omics” network constructed with a gene–gene network, a metabolite–metabolite network, and a gene–metabolite network to simultaneously prioritize candidate disease-associated metabolites and gene expressions could further utilize gene–metabolite interactions that are not used when prioritizing them separately. However, the number of metabolites is usually 100 times fewer than that of genes. Without accounting for this imbalance issue, we cannot effectively use gene–metabolite interactions when simultaneously prioritizing disease-associated metabolites and genes. Results Here, we developed a Multi-omics Network Enhancement Prioritization (MultiNEP) framework with a weighting scheme to reweight contributions of different sub-networks in a multi-omics network to effectively prioritize candidate disease-associated metabolites and genes simultaneously. In simulation studies, MultiNEP outperforms competing methods that do not address network imbalances and identifies more true signal genes and metabolites simultaneously when we down-weight relative contributions of the gene–gene network and up-weight that of the metabolite–metabolite network to the gene–metabolite network. Applications to two human cancer cohorts show that MultiNEP prioritizes more cancer-related genes by effectively using both within- and between-omics interactions after handling network imbalance. Availability and implementation The developed MultiNEP framework is implemented in an R package and available at: https://github.com/Karenxzr/MultiNep


S1.2 Choice of denoising thresholds for S E sub-networks
In step2 of the MultiNEP framework, the enhanced disease-specific network S t,sym needs to be further denoised. Because there are 3.3% and 3.5% edges in prostate and breast cancer S 0 g ; 23.0% and 17.7% edges in prostate and breast cancer S 0 m ; and 2.3% and 1.6% edges in prostate and breast cancer S 0 gm , we chose 5% and 30% as denoising thresholds for S t,sym g and S t,sym m , respectively. Since g-m interactions are much less studied, we kept more g-m interactions (15%) in S t,sym gm (Tables S2, S3).

S1.3 Data Processing
DF/HCC Prostate Cancer Cohort For metabolites, we followed the preprocessing pipeline as in Penney et al. [2021] implemented in the R package maplet Chetnik et al. [2021]. Briefly, we started with 222 metabolites detected from three different batches of experiments. We excluded 8 metabolites with more than 50% of missing, leaving us with 214 metabolites. After removing batch effects by correcting data to the run day median and performing quotient normalizations Dieterle et al. [2006], we imputed missing metabolites using K-nearest neighbor method Do et al. [2018]. We also excluded 11 metabolites that are not in the general network S 0 and ended up with 203 metabolites for analysis. For gene expressions, we used the robust multichip average (RMA) algorithm Irizarry [2003] from oligo R package Carvalho and Irizarry [2010] to normalize gene expressions and transformed to log2 scale, and only kept the probe with the largest variance for those with the same gene symbol annotation. We removed batch effects using 'removeBatchEffect' function from the R package limma Ritchie et al. [2015]. We ended with 29,054 genes where 18,009 of them are in the general network S 0 . Note that, MultiNEP does not require gene expression and metabolite data to be normalized.

GSE37751 Breast Cancer Cohort
There are 350 identified and pre-processed metabolites, among which 224 can be mapped in the general metabolite network. We downloaded the processed gene expression data, and transformed into log2 scale. We used the probes with maximum expression variance across samples to represent gene expressions if the gene has multiple probes. There are 23,199 genes with expression profiles, among which 17,202 genes are also in the general network. S1 S1.4 Application using GSE37751 Breast Cancer Cohort A Multi-omics general network A general multi-omics network S 0 was obtained from STRING and STITCH databases. We kept only genes and metabolites that are also in the breast cancer disease omics profiles. Thus, the general network S 0 has 17,202 genes, 224 metabolites, with 10,368,398 g-g interactions, 8,862 m-m interactions and 61,589 g-m interactions (Table S3).
Disease multi-omics profiles The omics profiles of GSE37751 Breast Cancer Cohort include DNA methylation, gene expression, and metabolome of fresh-frozen human breast tumors [Terunuma et al., 2013]. We only included 60 tumor and 47 normal-adjacent breast tissue samples (with 45 matched tumor and normal-adjacent pairs) with both gene expressions and metabolite abundances. Disease-specific similarity matrix E was constructed using 17,202 gene expressions and 224 metabolites of all 107 samples (60 tumor samples and 47 normal-adjacent tissue samples). Disease association scores v were generated using paired t statistics of 45 matched tumor and normal-adjacent pairs.
Signal prioritization We set λ g = 0.05, and λ m = 20 when applying MultiNEP to prioritize candidate breast cancer-related genes and metabolites. The 490 breast cancer (C0678222) related genes from DisGeNET were used as gold standards. We evaluated model performance by comparing the numbers of breast cancer-related genes prioritized by MultiNEP and competing methods within top ranked 1 to 500 candidate genes.
As shown in the top right panel of Figure 4, the general network, DiSNEP and MultiNEP prioritized similar numbers of breast cancer-related genes within top ranked 1 to 500 genes. We investigated reasons for the similar performances in section S1.5. Similar as in prostate cancer cohort, we investigated top ranked 200 genes identified by MultiNEP and DiSNEP. MultiNEP and DiSNEP prioritized 59 and 58 breast cancer-related genes. Of those 54 genes overlapping, 5 were uniquely identified by MultiNEP, and DiSNEP identified 4. We similarly calculated IRS for the 54 overlapping genes, 5 genes uniquely identified by MultiNEP and 4 genes uniquely identified by DiSNEP as in the application using the DF/HCC prostate cancer cohort, and observed similar patterns of results. Specifically, the 54 overlapping genes already have high IRS in S 0 g and S 0 gm , and can be identified by both DiSNEP and MultiNEP using either g-g interactions or g-m interactions. These genes have even higher IRS in S g but low IRS of 0.62 in S 0 gm . Without adjusting for relative contributions of g-g interactions to g-m interactions, DiSNEP can use their strong g-g interactions and overlook their weak g-m interactions to prioritize them with high ranks. Instead, MultiNEP lowered the relative contribution of their strong g-g interactions to their weak g-m interactions, and thus cannot identify them. The 5 genes uniquely identified by MultiNEP have high IRS of 3.27 in S 0 g and higher IRS of 9.36 in S 0 gm , so addressing more on their stronger g-m interactions relative to their g-g interactions can result in higher ranks. Above results again confirmed the ability of MultiNEP to handle network imbalance and boost signal prioritization performance.
Sensitivity analysis using a different general network S 0 For sensitivity analysis, we obtained the general multi-omics network from PathwayCommons [Cerami et al., 2010], which has 17,609 genes, 108 metabolites, 924,601 g-g interactions, 174 m-m interactions, and 5,216 g-m interactions (Table S3). We set λ g = 0.005 and λ m = 200 for MultiNEP when using PathwayCommons S 0 . Similarly, MultiNEP consistently outperforms competing methods and identified more breast cancer-related genes.
We investigated why there is a bigger improvement when using PathwayCommons S 0 . To do so, we examined the 16 prostate cancer-related genes and 20 breast cancer-related genes identified only by MultiNEP (PC), but not by the other five methods (General Net (PC), DiSNEP (PC), General Net (SS), DiSNEP (SS), and MultiNEP (SS)) among top 200 prioritized candidate genes based on DisGeNET. Using the 16 prostate cancer-related gene as an example, as shown in Table S5, these 16 genes have weak IRS in STRING & STITCH S 0 g (IRS=2.22) and S 0 gm (IRS=3.07). The g-g and g-m interactions are not strong enough to help prioritize these genes using either DiSNEP (SS) or MultiNEP (SS). On the contrary, these 16 genes have stronger IRS in PathwayCommons S 0 g (IRS=3.88) and extremely strong IRS in S 0 gm (IRS=24.52). These strong g-m interactions can only be efficiently used by MultiNEP (PC) after giving more weights on g-m interactions. Similar patterns can be observed in the 20 breast cancer-related gene uniquely identified by MultiNEP (PC).

S1.6 MultiNEP performance improvements with different omics profiles
As observed in Figure 4, with PathwayCommons, MultiNEP(SS) outperforms DiSNEP(SS) and General Net(SS) using either prostate cancer or breast cancer data. With STRING&STITCH, all three have similar performance especially when using breast cancer data ( Figure 4). We thus investigated rankings of IRS scores of genes that are also in STRING&STITCH or PathwayCommons, which we separated these genes into prostate/breast cancer-related genes based on the DisGeNET database and other noise genes. We considered rankings of these genes in terms of IRS. Recall that IRS are interaction ratio scores that measure if individual features (genes/metabolites) have stronger/weaker interactions relative to the average. That is, we investigated (1) differences in rankings of cancer-related genes in GeneralNET S 0 and that in DiSNEP S (1,1) Eg , which could reflect how much information disease omics profiles help in enhancing the general network, and could be used to understand differences in performance between GeneralNET and DiSNEP, and (2) differences in rankings of cancer-related genes in DiSNEP S ), which could reflect how much information the reweighting steps bring, and could be used to explain differences in performance between DiSNEP and MultiNEP.
For item (1), the mean differences in rankings of cancer-related genes GeneralNet -DiSNEP are: • for STRING&STITCH and prostate cancer: 3.84 (±1215), this suggests that these cancer-related genes rank higher in terms of IRS in DiSNEP after enhancement using omics data; • for STRING&STITCH and breast cancer: -93.4 (±818), this suggests that these cancer-related genes rank lower in term of IRS in DiSNEP after enhancement using omics data; • for PathwayCommons and prostate cancer: 37.5 (±1251); • for PathwayCommons and breast cancer: -12.1(±988).
These results explain the bigger improvements of DiSNEP over that of GeneralNET in prostate cancer data than that in breast cancer data. When comparing STRING&STITCH and PathwayCommons for prostate cancer data, the change in ranking is bigger on average for PathwayCommons (mean=37.5) than that for STRING&STITCH (mean=3.84), which explains the bigger improvements using PathwayCommons.
These results explain the bigger improvements of MultiNEP over that of DiSNEP in prostate cancer data than that in breast cancer data. When comparing STRING&STITCH and PathwayCommons for prostate cancer data, the change in ranking is much bigger on average for PathwayCommons (mean=182) than that for STRING&STITCH (mean=36.8) Overall, these results explain bigger improvements using PathwayCommons than using STRING&STITCH in general, and bigger improvements for prostate cancer data than that for breast cancer data in general. That also explains that the performance of three methods is similar in Figure 4 top right, that is, for breast cancer data using STRING&STITCH.

S1.7 Statistics of computational times
We use the DF/HCC prostate cancer data as an illustrative example to describe the computational times of MultiNEP and competing methods. S 0 from STRING and STITCH databases with 18,009 genes and 203 metabolites was used.  0.29% * : percentage of edges with non-zero weights out of all possible edges of a network with # nodes † : edges with non-zero weights: median (25 th percentile, 75 th percentile). 3.78% Gene-Metabolite 19,051 7,023 0.34% * : percentage of edges with non-zero weights out of all possible edges of a network with # nodes † : edges with non-zero weights: median (25 th percentile, 75 th percentile).

4.06
Figure S1: Simulation results comparing extended DiSNEP when using a multi-omics network to the original DiSNEP when using a single-omics network. Dashed lines are average numbers of identified true gene signals out of top ranked 1 to 500 prioritized candidate genes (the left panel), and average numbers of identified true metabolite signals out of top ranked 1 to 50 prioritized candidate metabolites (the right panel) using the original DiSNEP with a single-omics network. Solid lines are that when using the extended DiSNEP with a multi-omics network out of top ranked 1-500 genes, and out of top ranked 1-50 metabolites. All numbers are averaged over 100 simulations, with correlations between signal genes and signal metabolites set at ρ = 0.35. Figure S2: Simulation results comparing performance of MultiNEP S 0 that only reweights S 0 to that of MultiNEP that reweights both S 0 and E. Displayed are average numbers of identified true signal genes, signal metabolites, and both (genes and metabolites) out of top ranked 1 to 500 combined features across 100 simulations when correlations between signal genes and signal metabolites were set at ρ = 0.05, 0.2, 0.35, 0.5. We set λ g = 0.05, λ m = 20 for both MultiNEP S 0 and MultiNEP.