Deciphering associations between gut microbiota and clinical factors using microbial modules

Abstract Motivation Human gut microbiota plays a vital role in maintaining body health. The dysbiosis of gut microbiota is associated with a variety of diseases. It is critical to uncover the associations between gut microbiota and disease states as well as other intrinsic or environmental factors. However, inferring alterations of individual microbial taxa based on relative abundance data likely leads to false associations and conflicting discoveries in different studies. Moreover, the effects of underlying factors and microbe–microbe interactions could lead to the alteration of larger sets of taxa. It might be more robust to investigate gut microbiota using groups of related taxa instead of the composition of individual taxa. Results We proposed a novel method to identify underlying microbial modules, i.e. groups of taxa with similar abundance patterns affected by a common latent factor, from longitudinal gut microbiota and applied it to inflammatory bowel disease (IBD). The identified modules demonstrated closer intragroup relationships, indicating potential microbe–microbe interactions and influences of underlying factors. Associations between the modules and several clinical factors were investigated, especially disease states. The IBD-associated modules performed better in stratifying the subjects compared with the relative abundance of individual taxa. The modules were further validated in external cohorts, demonstrating the efficacy of the proposed method in identifying general and robust microbial modules. The study reveals the benefit of considering the ecological effects in gut microbiota analysis and the great promise of linking clinical factors with underlying microbial modules. Availability and implementation https://github.com/rwang-z/microbial_module.git.


Supplementary Table S2. Differential taxa between the IBD and control groups identified by DA analysis by LMM
We used Variational Bayes (VB) (Jordan et al., 1999) to estimate the posterior distributions due to the large size of the relative abundance tensor.VB provides approximated results but is much faster than the alternative method Markov Chain Monte Carlo (MCMC).VB aims to find the setting of the parameters of a selected family of distributions (Θ) that is closest to the posterior distribution by minimizing the Kullback-Leibler (KL) divergence between (Θ) and (| ̃).The distribution of (θ) is the form of In each iteration, the algorithm updates the variables one by one and calculates the ELBO.In addition, 〈〉 is calculated to check the convergence.The algorithm will continue to update the variables until the average changes of the modules drop to less than 1 taxa in the last ten iterations (Hore et al., 2016).The maximum number of iterations was set to 2000.
After posterior estimation, the members of a module were determined by the posterior inclusion probabilities (PIP), calculated as   (  ), which indicated the probability that a variable was included in the true model (Hore et al., 2016).A taxon was determined to be a member of a module if the corresponding  > 0.5.The maximum number of modules   in the experiments was set to 300, because some preliminary results demonstrated that 300 was large enough for the underlying modules.Distinct initializations of the variables would lead to different factorization results.To find a better result and to conduct a fair evaluation of the overall performance of the model, 30 runs with random initializations were executed, except   ,   and   , which were initialized to 0.5.The result of each run was evaluated by the subject classification performance using the top-ranking modules with associations to IBD sorted by the Wilcoxon rank-sum test, where the modules with the lowest adjusted p-values (Benjamini-Hochberg method (Benjamini and Hochberg, 1995)) were selected.

Visit-uncorrelated model
For the visit-uncorrelated model, the activities of the modules in the samples were assumed to be independent  .~ (0,   ).It was extended to be applicable to both complete and incomplete tensors by incorporating an indicator matrix  to mark the missing data: where  = { ̃, }.̂∈ ℝ 97×123×24 was the incomplete tensor constructed from the "24-visitset" where the relative abundance of the samples from each subject was placed in the cube according to their collected time points, leaving the unfilled entries as missing data.  = 1 if the sample of subject  at time point  was collected in ; otherwise,   = 0.The updates with respect to  are as follows: The updates for the other variables and the ELBO were also modified by adding the indicator matrix to summations over  and  in the same way as   .The part regarding the factor matrix  in ELBO (the fourth and fifth lines) was replaced by the following expression: The visit-uncorrelated model was applied to both "10-visit-set" ( ̃) and "24-visit-set" () and was inferred and evaluated in the same way as the visit-correlated model.

Relationships between the microbial taxa in the modules
To study the relationships between the microbial taxa in the same module, the intra-module co-occurrence and functional similarity were investigated, where the co-occurrence was measured by Dice Index (DI) and Spearman correlation, and the functional similarity was measured based on the contributed pathway abundance.Specifically, for each module, the members were identified by the PIP described above.For each pair of taxa (,  ′ ) in the module, the pairwise DI and absolute Spearman correlation were computed (1) individually by ) or Spearman correlation between  and  ′ , and () is the flattened vector of .For the former, the co-occurrence was first calculated across the samples for each subject and averaged over all subjects afterwards to indicate the co-occurrence of the taxa over time.The latter measured the co-occurrence across the samples from all subjects regardless of the effects of subjects and sampling time, taking each sample as an independent community.Then, the overall DI and Spearman correlation for the module were calculated by averaging the above pairwise measures within the module, respectively.The functional similarity of a pair of taxa in a particular sample was evaluated by the Jaccard similarity of their contributions to the pathways based on the functional profile.The intra-module functional similarity was computed by averaging the pairwise similarity of member taxa over all samples.When demonstrated in the figures, all of these measures were first multiplied by a scaling factor of 100 and then log-transformed.
In addition, the co-occurrence and functional similarity of the identified microbial modules were compared to that of randomly generated microbial groups.The pairs of taxa not included in any modules were selected as the candidates to form the random groups.For each identified module, ten equal-size groups (groups consisting of an equal number of taxa pairs as the module) were generated by randomly selecting taxa pairs from the candidates.The intra-group DI was calculated for each of the ten groups in the same way as the modules and then averaged as the final comparison to the DI of the module.The Spearman correlation and functional similarity of the random groups were measured similarly.

Permutational multivariate analysis of variance (PERMANOVA)
The variation of taxonomic composition and the microbial module activities explained by several clinical factors, including disease states, medication, antibiotics, immunosuppressant, chemotherapy, diarrhea, bowel surgery, age, sex and race, were quantified via PERMANOVA using the adonis function in the R package "Vegan" (Oksanen et al., 2020), respectively.The distances between the taxonomic composition of the samples were measured by Bray-Curtis dissimilarity.To address the issue of repeated measures of longitudinal data, restricted permutations were used in PERMANOVA.For the sample-level factors, including medication, antibiotics, immunosuppressant, diarrhea and bowel surgery, which recorded the status of the subjects in the week of sample collection, blocked permutations within subjects were adopted by using the "strata" option of the adonis function.For the subject-level factors, including disease states, age, sex and race, which were constant across the samples of a subject, permutations of plots (i.e., groups of samples collected from the same subject) were performed.The subject "C3031" was removed from the analysis due to the missing age information.
Regular PERMANOVA was conducted to evaluate the variation of microbial modules explained by the factors at the subject-level, because the activities of the microbial modules were estimated with respect to each subject without the effects of repeated measures.For each sample-level factor, an overall value was required for each subject.It was determined by merging the values of the factor across the samples from the subject.Since the sample-level factors were binary valued, the overall value of a sample-level factor for a subject was labelled as positive ("Yes") if there was at least one sample demonstrated positive for the factor; otherwise, it was labelled as negative (or "No").For example, if a subject was recorded using antibiotics (positive) in at least one sample, the overall value of the factor for the subject was marked positive.Euclidean distance between the activity vectors of the modules, i.e., the rows of the factor matrix , was used to measure the distances between the subjects.
Differentially abundant microbial taxa in terms of disease states (IBD and control) were identified from the log-transformed relative abundance profile consisting of all samples in the "10-visit-set".The linear mixed-effects model (LMM) was used to account for the effects of repeated measures (Lloyd-Price et al., 2019):

𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑎𝑏𝑢𝑛𝑑𝑎𝑛𝑐𝑒 ~ (𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡) + 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑠𝑡𝑎𝑡𝑒𝑠 + 𝑎𝑔𝑒 + 𝑎𝑛𝑡𝑖𝑏𝑖𝑜𝑡𝑖𝑐𝑠 + 1|𝑠𝑢𝑏𝑗𝑒𝑐𝑡
where the relative abundance of each taxon was modelled as a regression of the intercept, disease states, age and antibiotics use with subjects as a random variable.It was fitted using the lme function in the R package "nlme" (DebRoy, 2006).The estimated coefficient of disease states with respect to each taxon was used to indicate the association between them.Differentially abundant taxa were then identified with a target FDR of 0.25 (Lloyd-Price et al., 2019).
HMDAD is a resource that integrates the microbe-disease associations discovered in previous studies.Microbial associations in HMDAD were downloaded from http://www.cuilab.cn/hmdad.The associations of IBD, UC and CD at species level were selected, where the associated species not included in the discovery set were removed.In addition, we also collected the differentially abundant taxa detected in another two studies of IBD (Franzosa et al., 2019;Ma et al., 2021).

Classification of the subjects using relative abundance
In comparison to microbial modules, we also evaluated the classification performance of the relative abundance of the differentially abundant taxa as well as all microbial taxa.RF with 2000 trees was used and tuned by the R package caret (Kuhn, 2015).To thoroughly investigate the classification performance of relative abundance, we trained and tested RF models on the samples, the averaged profile and randomly generated one-visit profiles of the "10-visit-set", respectively.For sample classification, the profile containing all samples was used.The averaged profile was calculated by averaging the samples collected from the same subject, characterizing each subject by the averaged relative abundance of taxa over time.We conducted 10-fold cross-validation and calculated the average AUROC over the 10 folds for the all-sample profile and the averaged profile, respectively.For the evaluation of one-visit profiles, we generated ten one-visit profiles where each one was created by randomly selecting one sample from each subject.We calculated the average AUROC of 10-fold cross-validation for each one-visit profile and then evaluated the overall performance by averaging AUROC over the 10 profiles.The corresponding AUPR, sensitivity, specificity and precision were calculated similarly.

Supplementary
Fig. S3.Relative abundance (log-transformed) of the most active taxa in the modules associated with antibiotics (A), immunosuppressants (B) and diarrhea (C)."Yes" and "No" indicate the distinct groups of clinical factors.Supplementary Fig. S5.The intra-module Dice Index (A), absolute Spearman correlation (B) and functional similarity (C) of the IBD-associated modules compared to the equal-size random groups.Illustration of pairwise correlation coefficients of taxa evaluated by SparCC. A. Heatmap demonstrating the correlation of each taxa pair.Taxa pairs in the IBD-associated modules are distributed in the green rectangle.B. Boxplot of correlation comparison of taxa pairs in the IBD-associated modules, other microbial modules and those not included in any module.The correlations are log-transformed after multiplying a scaling factor of 100.

+
Pairs in the IBD-associated modules -Pairs in the other modules The intra-module Dice Index and absolute Spearman correlation of the IBD-associated modules in the validation cohorts.