HONMF: integration analysis of multi-omics microbiome data via matrix factorization and hypergraph

Abstract Motivation The accumulation of multi-omics microbiome data provides an unprecedented opportunity to understand the diversity of bacterial, fungal, and viral components from different conditions. The changes in the composition of viruses, bacteria, and fungi communities have been associated with environments and critical illness. However, identifying and dissecting the heterogeneity of microbial samples and cross-kingdom interactions remains challenging. Results We propose HONMF for the integrative analysis of multi-modal microbiome data, including bacterial, fungal, and viral composition profiles. HONMF enables identification of microbial samples and data visualization, and also facilitates downstream analysis, including feature selection and cross-kingdom association analysis between species. HONMF is an unsupervised method based on hypergraph induced orthogonal non-negative matrix factorization, where it assumes that latent variables are specific for each composition profile and integrates the distinct sets of latent variables through graph fusion strategy, which better tackles the distinct characteristics in bacterial, fungal, and viral microbiome. We implemented HONMF on several multi-omics microbiome datasets from different environments and tissues. The experimental results demonstrate the superior performance of HONMF in data visualization and clustering. HONMF also provides rich biological insights by implementing discriminative microbial feature selection and bacterium–fungus–virus association analysis, which improves our understanding of ecological interactions and microbial pathogenesis. Availability and implementation The software and datasets are available at https://github.com/chonghua-1983/HONMF.


Introduction
With the rapid development of high throughput sequencing techniques, more and more microbiome data have been accumulated (Janda and Abbott 2007, Wayne Litaker et al. 2007, De Vries et al. 2012, Callahan et al. 2016. The bacterial, fungal, and viral microbiome can be simultaneously profiled by using different sequencing methods, such as 16S rRNA (Janda and Abbott 2007), (ITS1) rRNA (Callahan et al. 2016), and VIDISCA-NGS (De Vries et al. 2012). The multi-omics microbiome datasets generated by these technologies provides an unprecedented opportunity to understand the diversity of bacterial, fungal, and viral components (Honda andLittman 2012, Belkaid andHand 2014). The previous studies reported that some critical illness was closely linked to changes of the composition of viruses, bacteria, and fungi (Legoff et al. 2017, Zuo et al. 2019. Dissecting the difference of microbial components from different samples is important to understand pathogenic mechanism. Computational approaches that utilize only single bacterial or viral composition profile cannot comprehensively reveal the manners that microbes play in shaping microbial ecology. Increasing evidence also shows that there exists complicated relationship (high-order interaction) among the bacterial microbiome, viral microbiome, and host (Pfeiffer andVirgin 2016, Shkoporov andHill 2019). These findings provide important clues that cross-kingdom interactions potentially induce diseases, but a knowledge gap remains on the manners and strength of interactions of bacteria, fungi, and viruses. Exploring the interactions between bacteria and viruses, fungi, and viruses is key to understand their latent roles in the development of inflammatory bowel disease, cancer, and sepsis (Sokol et al. 2017, Sovran et al. 2018, Haak et al. 2021. Hence, there is an essential need for integrative analysis frameworks that can systematically identify microbial latent patterns and associations relationships across different conditions (Richard and Sokol 2019).
Recently, unsupervised learning methods have been developed to integrate multi-omics data from the same samples, including the SNF framework and its variants (Wang et al. 2014, Zhang et al. 2017, Liu and Shang 2018, but these methods are not initially designed for microbiome data analysis. The integration methods for multi-omics microbiome data include WSNF (Mac Aogá in et al. 2021) and MOFA (Argelaguet et al. 2018, Haak et al. 2021. WSNF assumes that there exists a consensus sample similarity network, and it is shared across distinct composition profiles. MOFA assumes that the latent variables, i.e. low-dimensional representation of the samples, are the same for the bacterial, fungal, and viral composition profiles data. However, these assumptions may be restrictive for multi-omics microbiome data, because these compositional profiles have different characteristics. MOFA takes the three microbiome abundance matrices (bacteria, fungi, and viruses) as input, and learns a low-dimensional representation of the samples and three feature-by-factor loading matrices (one per kingdom). Compared with MOFA, one drawback of SNF and WSNF is that the similarity matrices represent the similarity between the samples, and they cannot provide direct biological insights of the microbial features.
In this manuscript, we propose HONMF to systematically integrate multi-omics microbiome data, where bacterial, fungal, and virus compositional profiles were obtained from the same samples. HONMF is a versatile tool that enables clustering of the samples and data visualization, and it facilitates downstream biological analysis, including feature selection and cross-kingdom association analysis, such as bacteriumvirus interaction, fungus-virus interaction. HONMF is a novel unsupervised learning framework, named hypergraph induced orthogonal non-negative matrix factorization (NMF). Unlike SNF and WSNF which assume that a consensus similarity network is shared across different modalities, HONMF assumes that latent variables are specific for each modality and integrates three sets of latent variables through graph fusion strategy, which better tackles the distinct characteristics in bacterial, fungal, and virus compositional profiles. In addition, HONMF preserves the high-order geometrical structures in original data by hypergraph, which is an important way to reveal the complex relationships for more than two species. Analyzing on three multi-omics microbiome datasets from different tissues (including gut and sputum) and environment (soil), we show that HONMF is effective in identifying sample types: HONMF achieves superior performance in clustering and data visualization. The learned sample-sample similarity matrix has good biological meaning: it absorbs the complementary information from each data modality, and encodes high-order interaction information from distinct composition profiles; the similarity matrix can be used to identify discriminative bacteria, fungi, or viruses in different sample clusters. HONMF can also implement bacterium-fungus-virus association analysis based on these discriminative microbial features, which improves our understanding of ecological interactions and microbial pathogenesis. An overview of HONMF is shown in Fig. 1.

Datasets and data preprocessing
The first dataset used in this manuscript was from literature (Haak et al. 2021) and are downloaded from GitHub repository(https://github.com/bwhaak/MOFA_microbiome). Faecal samples from 33 patients admitted to the Intensive Care Unit (ICU) and 13 healthy individuals were collected. Of these patients, 24 were admitted with sepsis while 9 patients had a non-septic ICU diagnosis. Both bacterial 16S rRNA and the fungal ITS rDNA gene are parallelly profiled from the same single gut samples. In addition, virus composition profile is simultaneously obtained using VIDISCA-NGS.
The second dataset downloaded from NCBI SRA (PRJNA59025) included 166 patients with stable bronchiectasis. Sputum sample from each participant was collected and simultaneously sequenced (Mac Aogá in et al. 2021). The bacterial, fungal, and virus composition profiles were obtained after extracting sputum DNA and RNA using a standard pipeline previously described (Coughlan et al. 2012).
The third dataset downloaded from Wagg et al. (2019) was for soil ecosystems. Both bacterial and fungal profiles were simultaneously sequenced using 16S and ITS for 48 samples. Note that the virus compositional profile was not provided in the original publication, we further modified the proposed model to implement the two-modal microbiome dataset described above.
The statistical information of these three data has been presented in Supplementary Table S4.

Overview of NMF
Given a data matrix X 2 R pÂn þ , the traditional nonnegative matrix factorization (NMF) aims to find two low-rank matrices W 2 R pÂk þ and H 2 R kÂn þ to approximate X, where p is the number of features, n is the number of samples and k is the number of factors (Lee and Seung 1999). The objective function of NMF is as following.
where W is basis matrix, and H is coefficient matrix, both are non-negative. Á j j j j F denotes Frobenius norm of a matrix. The matrix H represents the low-dimensional representation for the observations.
Besides the classic NMF, tri-factor symmetric NMF (tri-sNMF) was also used for data representation and clustering . The objective function is defined as the following.
where H T denotes the transpose of matrix H, G is a symmetric matrix and I is the identity matrix with suitable size. Compared with NMF, the advantages of tri-sNMF lies in: H is closer to the form of clustering and S provides a good indicator for clustering quality (Ding et al. 2005, Ma et al. 2020. The clusters are well separated, when the diagonal elements in S are much larger than the off-diagonal elements (Ding et al. 2005).

HONMF model
To dissect microbial sample heterogeneous from bacterial, fungal and virus composition profile level, we introduce hypergraph induced orthogonal nonnegative matrix factorization model (HONMF). Given the bacterial composition profile matrix X 1 ð Þ 2 R pÂn þ (p bacteria species in n samples), the fungal profile matrix X 2 ð Þ 2 R qÂn þ (q fungal species in n samples) obtained from the same samples, and the virus profile matrix X 3 ð Þ 2 R rÂn þ (r viral pathogens in n samples), HONMF aims to learn a consensus sample-sample similarity matrix S 2 R nÂn þ , which integrates multiple molecular modalities obtained from the same microbial samples. The objective function of HONMF is as the following: Here, H i ð Þ 2 R nÂk þ represents the matrix of low-dimensional representation (i.e., latent variables) for the ith data modality (bacterial, fungal or virus composition profile) of the samples. G 2 R kÂk þ represents the connections among different clusters and is symmetric. S is the learned sample-sample similarity matrix that can be used for clustering and data visualization of the microbial samples. 1 is a column vector with all elements to be 1s. L ðiÞ hyg represents the hypergraph Laplacian for the ith data modality, and it captures high-order relationship in original microbiome data (Zhou et al. 2006, Gaudelet et al. 2018, Jin et al. 2019. A i ð Þ is the similarity matrix obtained from ith microbial composition profile matrix X i ð Þ . g is the parameter that reflects strength of the orthogonal constraint imposed to the columns of H i ð Þ , and is set g ¼ 10 for all datasets. b is a parameter that is used to control the strength of the constraint S1 ¼ 1 and is set b ¼ 1 for all datasets. a is a parameter used to control the strength of regularization that makes each kernel H i ð Þ H i ð Þ T from each composition data towards a consensus graph S. Large values of a imply more closer between H i ð Þ H i ð Þ T and S. c is graph regularization parameter. In the later section, we will discuss how to choose a and c. In the objective function of HONMF [Equation (3)], the first term, , is standard tri-factor symmetric NMF loss function for bacterial, fungal and virus composition profile data. The sample similarity matrix A i ð Þ can be obtained by Gaussion kernel function. The details of constructing A i ð Þ are presented in Additional file 1. The second term, , is a consensus graph fusion operation that integrates different composition profile data to learn a sample-sample similarity matrix S. One of the advantages is that it regularizes each kernel H i ð Þ H i ð Þ T from each composition data towards a consensus graph S. The third term, , encourages the low-dimensional representations H i ð Þ to be column-orthogonal, and is used to preserve the uniqueness of the solution. The fourth term, Figure 1. An illustrative example of HONMF. HONMF is designed for analyzing multi-omics microbiome data where bacterial, fungal and virus composition profiles data are simultaneously obtained. (a) Bacterial, fungal and virus abundance matrices X 1 ð Þ , X 2 ð Þ and X 3 ð Þ (each row represents a feature, e.g. bacterium, fungus or virus, and each column represents a sample). (b) For each composition profile, sample-sample similarity matrix A i ð Þ is firstly computed via kernel function. Simultaneously, hypergraph is constructed based on each composition profile matrix. Then sample similarity matrices and hypergraphs are used as inputs of HONMF. (c) HONMF learns the low-dimension representation matrices of samples (i.e. the latent variables) H i ð Þ , and the sample-sample similarity matrix S that summarizes the information in the H i ð Þ , H 2 ð Þ and H 3 ð Þ . (d) The sample-sample similarity matrix S facilitates downstream analysis, including data visualization and clustering. In addition, S can also be used to discriminative microbial feature selection and bacterium-fungus-virus associations analysis. S1 À 1 j j j j 2 F is a normalization term on S that encourages each row in S to have summation close to 1.
Through iteration fusion, the first four terms in Equation (3) can learn the low-dimensional representation for each data modality, however, high-order interaction relationships involving more than two species may be lost. Some literatures reported that high-order microbial interactions are prevalent and dominate the functional landscape of microbial communities, and enable bacteria to deal with new complex environments (Sanchez-Gorostiaga et al. 2019, Ludington 2022). However, graph-based learning methods, such as graph Laplacian, only consider pairwise interaction (Cai et al. 2010, Ma et al. 2020. Hypergraph can effectively solve this problem. Unlike classic graph in which two vertices are linked by an edge, a group of vertices is viewed as a hyperedge in a hypergraph (Zhou et al. 2006). Modeling these high-order interactions with hypergraph can significantly enhance clustering performance. Therefore, we include a fifth term, We note that MOFA (Argelaguet et al. 2018(Argelaguet et al. , 2020) also integrates multi-modal microbiome data. Our proposed HONMF model differs from MOFA in the following several aspects: (i) MOFA assumes that H is shared by bacterial, fungal and virus composition profile matrices, which may be not appropriate due to batch effects. In HONMF we relax this assumption by allowing H to vary for each composition profile data and use S derived from graph fusion operation to integrate cross-modality information; (ii) HONMF adopts trifactor NMF and encourages H to be column-orthogonal, and MOFA does not. tri-Factor NMF has more flexible ability in data analysis tasks . The orthogonal constraint on the low-dimensional representations of microbial samples leads to better clustering solutions and interpretability: the columns in H i ð Þ will tend to be sparse; (iii) HONMF includes an additional term with hypergraph Laplacian to explore the complicated high-order microbial interactions, and simultaneously enhances the representation ability of lowdimensional factor matrices. The detailed optimization algorithm for HONMF is presented in Supplementary File S1.

Construction of hypergraph
In a simple graph, an edge only connects to two vertices and the edge weight indicates the relationships between these two vertices. However, in many real-world tasks, representing a set of complex relational objects as a simple graph may cause information loss. For example, to group a number of articles into corresponding topics, one can construct a simple graph where two articles are connected with an edge if there is at least one common author writes them, and then spectral clustering technique is applied (Zhou et al. 2006). The above graph representation method obviously misses some useful information in this case that the same author may write three or more articles. Such unexpected lost information is useful to cluster different topics.
A natural way to handle with information loss problem existed in simple graph is to represent the data relationships as hypergraph (Fig. 2) In hypergraph an edge can connect more than two vertices. Let V denote the set of vertices and E denote hyperedge set. For each hyperedge e, [ e2E ¼ V and its weight is denoted as w e ð Þ. A weighted hypergraph is represented as G ¼ V; E; W ð Þ . W is a diagonal matrix which represents the weights of hyperedges. The weight construction rule for each hyperedge e is described as following.
The weight for hyperedge e is set with gaussian kernel function. Specifically, we first compute the similarity between any two nodes belonging to e with gaussian kernel function. Then, the total similarity is computed as the weight of hyperedge e. Here, the bandwidth parameter is set as the mean of the squared Euclidean distance between two nodes in the hyperedge e. The details are presented in Supplementary File S1.
The incidence matrix P 2 R V j jÂ E j j of G with entries p v; e ð Þ is defined as the following.
where V j j represents number of vertices in hypergraph, and E j j represents number of hyperedges. For a vertex v 2 V, its degree is defined as d Let D v and D e denote degree matrices whose elements are the vertex degrees and hyperedge degrees, respectively. Then the Laplacian matrix of a hypergraph L hyg can be defined as the following.
In contrast to k-nearest neighbors (KNN), in this manuscript we use Louvain community detection algorithm (Blondel et al. 2008) to construct hyperedges: each cluster is represented as a hyperedge. This strategy avoids to the influenced of outlies and leads to better interpretability.
The hypergraph captures the high-order interactions, the regularization term, i.e. the fifth term of Equation (3), can be derived by: where H is the low-dimensional representation of microbial samples.

Selection of parameters a and c
In HONMF, a and c are the graph regularization parameters, and they are determined as the following. First, we solve the optimization problems

Evaluation metrics
Normalized mutual information (NMI) (Strehl andGhosh 2002, Vinh et al. 2010), adjusted rand index (ARI) (Santos and Embrechts 2009), and silhouette coefficient (Kaufman and Rousseeuw 2009) are used to evaluate the performance of the clustering methods. Let G denote ground-truth labels of microbial samples provided in the original publications, and P denote the predicted clustering assignments. NMI is computed as the following: where MI G; P ð Þis the mutual information between two label sets: G and P, H G ð Þ and H P ð Þ are the information entropy of G and P, respectively. High NMI values indicate good clustering consistency.
Assume that N is the number of microbial samples in a given dataset, N i is the number of samples in the ith sample types in partition G, N j is the number of samples in the jth cluster in partition P, and N ij is the number of samples of the ith label assigned to the jth cluster in partition P. ARI is defined as: For unlabeled dataset, we use an unsupervised metric, silhouette coefficient (Rousseeuw 1987, Xu et al. 2017, to evaluate the clustering performance. Let a i ð Þ denote the average distance of microbial sample i to all other samples within the same cluster with i, and b i ð Þ denote the average distance of i to all samples to the neighboring cluster, i.e., the smallest average distance to the cluster of i. The silhouette coefficient for microbial sample i is defined as: A larger silhouette coefficient of one microbial sample indicates that the sample is close to other samples in the same cluster, and distant from samples in other clusters. The average value of silhouette coefficients for all the microbial samples is computed as the final evaluation metric.

Identifying discriminative bacteria, fungi, or viruses with Laplacian score
Given S obtained from HONMF, its corresponding degree matrix D ¼ P i¼1 S ij and Laplacian matrix L ¼ D À S, the Laplacian score ) of a feature f is computed as follows.
Here, 1 is a column vector with all elements to be 1 s. The top k features with have the minimal SC values are picked up, and used as the downstream analysis.

HONMF achieves good clustering performance on different datasets from different tissues and environments
We evaluated the proposed HONMF on three multi-modal microbiome datasets. These datasets include a gut microbiome dataset and a sputum dataset from patients with stable A hyperedge can connect more than two vertices. (b) The incidence matrix corresponding to hypergraph. The entry (v i , e j ) is set to be 1 when v i belongs to e j , and 0 otherwise. bronchiectasis, where the bacterial, fungal, and virus composition profiles were sequenced for the same sample; soil microbiome dataset, where the bacterial and fungal composition were profiled for the same sample from grassland ecosystems.
We compared HONMF with several recently published methods for multi-modal microbiome data integration, including MOFAþ (Argelaguet et al. 2020), SNF (Wang et al. 2014), andWSNF (Mac Aogá in et al. 2021). For SNF and WSNF, we implemented their default clustering method and all the parameters were set to default. MOFAþ gives the lowdimensional representations of the samples, and no clustering method was provided. To facilitate direct comparison, we used SNN (shared nearest neighbor graph) þ Louvain clustering (Blondel et al. 2008) (the SNN graph was constructed by the low-dimensional matrix Z obtained from MOFAþ) to evaluate its performance. For HONMF, we first constructed k-nearest neighbor (KNN) graph with k ¼ n=2 (n is the number of samples) using the sample-sample similarity matrix, and then implemented Louvain clustering on the KNN graph.
The clustering performance evaluated by ARI, NMI, and silhouette scores are presented in Fig. 3. For Dataset 1, ARI and NMI were computed based on the ground-truth microbial sample labels provided in the original publication. Acknowledging that silhouette score is an unsupervised metric and it does not require true labels, we also evaluated the clustering performance in terms of silhouette score.
As shown in Fig. 3, we can see that the proposed HONMF method performs well on three datasets in terms of ARI, NMI, and average silhouette score. SNF also performs well on the gut microbiome dataset in terms of ARI, but not as well as NMI. Silhouette score was computed based on the sample similarity matrix and clustering results given by each method, and it does not require "ground truth" labels. For silhouette criterion, HONMF achieves the best performance, compared with other methods. For the sputum dataset, MOFAþ performs well in the average silhouette score metric on sputum data. The numeric values of the clustering performance are presented in Supplementary Table S1.

Ablation study
We implemented several simple versions of HONMF [Equation (3)], where we set a; c and g equal to 0 in turn. HONMF achieves the consistent good performance in most of the datasets (Additional file 1: Supplement Table S2). We also tested a simple variant of model (3): in GONMF, we replace hypergraph Laplacian with simple graph Laplacian. The performance of GONMF was not as good as HONMF, which suggests that it is beneficial to integrate high-order interaction information into model (3) with hypergraph (Additional file 1: Supplementary Table S3). More details for implementing are presented in Additional file 1.

HONMF facilitates microbiome data visualization
We next performed UMAP visualization (McInnes et al. 2018) based on learned sample-sample similarity matrices and low-dimensional factor matrix. The visualization results are presented in Fig. 4. For gut data, we used the microbiome sample labels provided in its original publication for fair comparison. For sputum and soil data, we used the labels identified by each method to assess visualization results. For all three datasets and for UMAP visualization, HONMF performs well among these four methods.
As shown in Fig. 4, HONMF can also identify the less abundant sample subpopulations. The healthy individuals that received oral broad-spectrum antibiotics are clearly distinguished by HONMF. The healthy individuals (green) that did not receive antibiotics are well separated with individuals treated with antibiotics (Fig. 4a). For sputum and soil datasets, HONMF achieves the consistent good performance (Fig. 4b and c).

Testing clustering significance with SigClust
To test the significance of clustering results, SigClust tool is used on these multi-omics microbiome datasets (Liu et al. 2008). Figure 5 shows that any two clusters obtained from HONMF are statistically significant (gut microbiome data). Statistical significance of clustering on other datasets is presented in Supplementary Fig. S1.
As shown in Fig. 5, the clustering obtained from HONMF is statistically significant. P-values between any two clusters are small or approximate to zero. The analysis of clustering significance on sputum dataset has similar results. For the clustering performance evaluated by silhouette score, HONMF is either the best or the second best among all the methods.

Identifying bacterium-fungus-virus associations with LS
Fungus and viruses often directly and indirectly interact with bacteria in human disease. What is known on these associations improves our understanding of ecological interactions and microbial pathogenesis. In this subsection, feature selection is firstly conducted to identify discriminative bacteria, The microbiome sample labels provided in the original publication were used as ground-truth labels. (c) Evaluation of the clustering results in terms of the average silhouette score. The silhouette score quantifies how well a sample is matched to its identified cluster compared to its neighboring cluster. Silhouette scores are computed based on the sample similarity matrices and labels identified by each method. Finally, the average silhouette score of all samples was reported 6 Ma et al.
fungi, or viruses in different groups. Then, bacterium-fungusvirus associations analysis is implemented based on these features. For multi-omics microbiome data, some microbes play important roles in process of disease development. To identify these biological meaning microbial features, we used LS to implement feature selection .
Next, we implement bacterium-fungus-virus association analysis with sample-sample similarity matrix S obtained from HONMF and the selected features above. For microbial feature vectors a and b, their correlation is computed as follows.
a ¼ aÃS;b ¼ bÃS: Large corrða; bÞ values indicate that these two features may have latent association. Figure 6 shows the discriminative bacterium, fungus, and virus features, and the associations between them.
Top 10 bacterial taxa identified by LS, including Blautia, Agathobacter, Enterococcus, Roseburia, Lachnospira, and Faecalibacterium are key features in distinguishing health and sepsis illness (Fig. 6a). These signatures were driven by antibiotic perturbation. Interestingly, some facultative aerobic bacterial pathobionts, such as Enterococcus have been previously associated with sepsis (Alverdy and Krezalek 2017), and bacterial taxa, Lachnospira, has been reported to be biomarkers of a healthy microbiota and is related to colonization resistance against bacterial pathobionts (Lee et al. 2017). Top five fungal taxa identified by LS, such as Aspergillus, Saccharomyces, Paraphaeosphaeria, and Piskurozyma are related to sepsis (Fig. 6b). Piskurozyma was previously found to be absent in critically ill patients and present in healthy subjects (Haak et al. 2021). For viral taxa, most important signatures are also identified by microbiome sample similarity matrix obtained from HONMF, such as Enterococcus, Escherichia, Enterobacteriaceae, Bacteroides, Streptococcus, and Lactococcus (Fig. 6c). These findings are consistent with the previous research (Haak et al. 2021).
Other than discriminative microbial features, the sample similarity matrix S facilitates bacterium-fungus-virus association analysis, which provides rich insights into microbial pathogenesis. Blautia and Roseburia, members of the Lachnospiraceag family, show negative correlations with Saccharomyces cerevisiae (Supplementary Fig. S4). These findings are supported by previous studies (Nguyen et al. 2011, Garc ıa et al. 2017. For bacterium-virus associations, we found that Enterobacteriaceae has strong negative correlations with bacterium taxa Agathobacter, Roseburia, Faecalibacterium, Blautia, and Lachnospira (Fig. 6d). For fungusvirus associations, Enterobacteriaceae are positively associated with fungal taxa Aspergillus, Penicillium and Saccharomyces, and negatively associated with Dipodascus (Fig. 6e). These results are accord with Haak's research where they use MOFA to implement factor analysis (Haak et al. 2021).
To summarize, HONMF facilitates the identification of discriminative features on multi-omics microbiome data, by inspecting LS scores. The association analyses for bacterium, fungus, and virus provide further indications that the gut trans-kingdom features identified by HONMF are biological meaningful.

Discussion and conclusions
The accumulation of multi-omics microbiome data provides an unprecedented opportunity to understand the diversity of bacterial, fungal, and viral components. Here, we proposed HONMF, which integrates different composition profiles in multi-modal microbiome data. Network fusion-based methods (including SNF and WSNF) typically assume that a consensus similarity network is shared across different modalities. Unlike these approaches, HONMF assumes that each composition profile has specific latent variables, and merges different sets of latent variables with graph fusion strategy (the second term in the objective function of HONMF). We conduct experiments on three multi-modal microbiome datasets. The results demonstrate that HONMF has better clustering performance and visualization, through relaxing the consensus similarity network assumption, which introduces more flexibility and better deals with the distinct characteristics in various composition profiles. HONMF also takes advantage of hypergraph learning to encode high-order geometrical structures in original data. Compared to simple graph, hypergraph leads to improved clustering qualities (Supplementary Table S3). In addition, HONMF facilitates downstream biological analysis, including microbial signature selection and cross-kingdom association analysis of gut microbiome.
In the experiments, the number of clusters k is set to be the one provided in the original publication: k ¼ 4 for gut dataset; k ¼ 3 for sputum and soil data. We also tested the robustness of HONMF on the number of on the number of factors (the dimension of latent features of H), where we varied the number of factors from 2 to 5. The clustering performance evaluated by NMI, ARI, and silhouette score are presented in Additional file 1: Supplementary Table S5. For sputum datasets, the performance is robust to the number of factors. The dataset that seems less robust is the gut and soil dataset. One of possible reasons is that the gut and soil microbiome data tends to have high level of noise and more complicated compositions.
For hyperparameters selection, two rules are designed to assign initial values to graph regularization parameters a and c.
To validate the effectiveness of objective function, we implement several simple versions of HONMF, where we set a; c and g equal to 0 in turn. The experimental results show that HONMF achieves the consistent good performance (Additional file 1: Supplement Table S2). Moreover, we also implement comparison tests where simple graph Laplacian is substituted for hypergraph Laplacian. However, in most cases, the performance of GONMF was not as good as HONMF (Additional file 1: Supplementary Table S3).  Ma et al.