GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data

Abstract Motivation With advances in metagenomic sequencing technologies, there are accumulating studies revealing the associations between the human gut microbiome and some human diseases. These associations shed light on using gut microbiome data to distinguish case and control samples of a specific disease, which is also called host disease status classification. Importantly, using learning-based models to distinguish the disease and control samples is expected to identify important biomarkers more accurately than abundance-based statistical analysis. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbiome data and decreased accuracy in cross-studies. The confounding factors, such as the diet, technical biases in sample collection/sequencing across different studies/cohorts often jeopardize the generalization of the learning model. Results To address these challenges, we develop a new tool GDmicro, which combines semi-supervised learning and domain adaptation to achieve a more generalized model using limited labeled samples. We evaluated GDmicro on human gut microbiome data from 11 cohorts covering 5 different diseases. The results show that GDmicro has better performance and robustness than state-of-the-art tools. In particular, it improves the AUC from 0.783 to 0.949 in identifying inflammatory bowel disease. Furthermore, GDmicro can identify potential biomarkers with greater accuracy than abundance-based statistical analysis methods. It also reveals the contribution of these biomarkers to the host’s disease status. Availability and implementation https://github.com/liaoherui/GDmicro.

where where ϕ is hidden layers of the network, H k is the reproducing kernel Hilbert space endowed with a characteristic kernel k, and µ k (p) is the mean embedding of distribution p in H k .The main purpose of MK-MMD is to minimize the d 2 k such that the two domain distribution p and q become closer.When d 2 k = 0, p will be equal to q.Thus, the MK-MMD is able to reflect the discrepancy between the source and target.As a result, by minimizing the loss function with the MK-MMD-based adaptation regularizer, the model can learn the transferable latent features between data from different domains.

Analyzing the influence of test sample size on GDmicro's performance
In this experiment, we investigated the influence of test sample size on the model's performance.To achieve this, we run GDmicro on test sets of different sizes, including 1 (named as "single"), 3, 5, half of all samples (named as "half"), and all samples (named as "batch").For test sample size n between single and batch, considering all combinations can lead to tedious setup and long running time.Thus, we randomly selected n samples from all and repeated this process 50 times.Then, we report the average AUC for these selections.
We first explored the influence of test sample size on the 10-fold cross-validation experiment.As shown in Supplementary Figure S1, GDmicro's performance consistently improves as the number of input test samples increases across all tested cohorts.Especially, the improvement was significant when the number of test samples increased from 5 to half and from half to all for most tested cohorts.However, the performance on the CRC-FR and IBD-DK cohorts does not exhibit significant improvement when we increase the number of test samples from 5 to half, in contrast to the improvement observed when increasing from half to all.This discrepancy could be attributed to the limited total number of test samples, resulting in a similar test sample size between the 5 and half groups for these two cohorts.
Then, we further investigated the influence of test sample size on the cross-study experiment.As shown in Supplementary Figure S2, GDmicro still has improved performance with an increase in the number of test samples.On CRC-DE and CRC-AT cohorts, we also notice that the performance has a rapid convergence when the number of test samples increases to three.In contrast, the performance of the remaining cohorts displays varying changes but exhibits an overall improvement.These experiments show the significance of test sample size in enhancing the performance of GDmicro and highlight the advantages of increasing cohort size for achieving better results.

Ablation study and parameter analysis
In this experiment, we study how different architectures and parameters influence the performance of GDmicro using ablation study and parameter analysis.Specifically, we analyzed the influence of the adaptation loss function, GCN model, and hyper-parameter k in the kNN graph on the performance of GDmicro.To be more consistent with the usage of real-world data, we analyzed datasets of the cross-study experiment.
As discussed in the Methods section, the loss function used in the deep adaptation network is based on multiple kernel variants of maximum mean discrepancies (MK-MMD), which aims to reduce domain discrepancy between data from different studies.To show the effect of different loss functions on the performance of GDmicro, we repeated the analysis in the cross-study experiment with and without MK-MMD-based loss.When the loss function only contains the cross-entropy loss, the model is a multi-layer fully connected network (aka multi-layer perceptron or MLP) that ignores the domain discrepancy.In addition, to know whether the GCN model improved the classification performance, we also combined the deep adaptation network and MLP for host disease status classification in the cross-study experiment.
As shown in Supplementary Figure S3A, GDmicro with default architecture achieved better performance than the model without domain adaptation in five out of seven tested datasets.This result indicates that the MK-MMD-based loss improves the model's robustness by learning transferable latent features.Related to this, the deep adaptation network outperformed MLP in all tested cohorts, which demonstrated the deep adaptation network improved the classification robustness by minimizing the domain discrepancy.We also noticed that GDmicro with default architecture achieved better performance than the single deep adaptation network and MLP, demonstrating that the GCN model improved the classification AUC by incorporating structural and compositional abundance features and utilizing information from unlabeled samples.
The hyper-parameter k is an important parameter for the kNN graph, which determines the graph's topological structure.Thus, we investigated the performance of GDmicro under different k by repeating the analysis in the crossstudy experiment with k ∈ {3, 5, 7, 10}.Supplementary Figure S3B shows the performance of GDmicro when k varies from 3 to 10.As shown in Supplementary Figure S3B, the performance of GDmicro doesn't fluctuate much in all tested datasets with the change of k, which indicates that GDmicro is not very sensitive to k.By default, we use k = 5 to construct the kNN graph.

LOSO experiments with top 50 features selected by different methods
To identify biomarkers with the Wilcoxon test, we calculated the p-value of each feature using all the training data.In this experiment, a positive p-value signifies that the feature is enriched in disease samples, whereas a negative value indicates enrichment in healthy samples.Subsequently, all features are sorted from smallest to largest based on the absolute value of their p-values.To avoid data bias, we repeated the LOSO experiment with the top 50 features identified by GDmicro and the statistics-based method.The result shows that the average AUC for GDmicro is 0.891, while the average AUC for the statistics-based method is 0.869 (Supplementary Figure S6), a finding consistent with the result observed for the top 10 features.