MicrobiomeAnalyst 2.0: comprehensive statistical, functional and integrative analysis of microbiome data

Abstract Microbiome studies have become routine in biomedical, agricultural and environmental sciences with diverse aims, including diversity profiling, functional characterization, and translational applications. The resulting complex, often multi-omics datasets demand powerful, yet user-friendly bioinformatics tools to reveal key patterns, important biomarkers, and potential activities. Here we introduce MicrobiomeAnalyst 2.0 to support comprehensive statistics, visualization, functional interpretation, and integrative analysis of data outputs commonly generated from microbiome studies. Compared to the previous version, MicrobiomeAnalyst 2.0 features three new modules: (i) a Raw Data Processing module for amplicon data processing and taxonomy annotation that connects directly with the Marker Data Profiling module for downstream statistical analysis; (ii) a Microbiome Metabolomics Profiling module to help dissect associations between community compositions and metabolic activities through joint analysis of paired microbiome and metabolomics datasets; and (iii) a Statistical Meta-Analysis module to help identify consistent signatures by integrating datasets across multiple studies. Other important improvements include added support for multi-factor differential analysis and interactive visualizations for popular graphical outputs, updated methods for functional prediction and correlation analysis, and expanded taxon set libraries based on the latest literature. These new features are demonstrated using a multi-omics dataset from a recent type 1 diabetes study. MicrobiomeAnalyst 2.0 is freely available at microbiomeanalyst.ca.


INTRODUCTION
Over the past decade, microbiome studies have experienced tremendous gr owth acr oss di v erse disciplines with a clear trend towar ds le v eraging multiple omics technologies for comprehensi v e characterization of the underlying communities ( 1 , 2 ). The microbiome is now considered a key player in human health and sustainable agriculture (3)(4)(5)(6). Powerful bioinformatics pipelines and tools have been continuously de v eloped and updated to help anal yze increasingl y complex datasets (7)(8)(9)(10). Version 1.0 of MicrobiomeAnalyst was de v eloped to provide a user-friendly w e b-based platf orm f or bench r esear chers to perform compr ehensi v e exploratory analysis on common abundance profiles and taxonomic signatures ( 11 ). Since its release in 2017, Mi-crobiomeAnalyst has been continuously updated based on user feedback, with a detailed analysis protocol published in 2020 ( 12 ). Based on Google Analytics, the MicrobiomeAn- alyst public server has processed > 125 000 jobs submitted from > 30 000 users worldwide during the past 12 months.
Microbiome data analysis is conceptually similar to other omics data analysis workflows, consisting of three typical stages: raw da ta processing, sta tistical analysis, and functional interpretation. In practice, howe v er, microbiome data shows much higher heterogeneity with particularly strong inter -individual and inter -popula tion dif ferences, causing statistical issues including zero inflation, compositionality and ov er dispersion (13)(14)(15). These characteristics hav e motivated the development of a wide array of analysis methods, resulting in a landscape challenging for researchers who are not experts in statistics or programming. The marker gene data analysis has seen a shift from the traditional operational tax onom y units (O TUs), w hich ar e clusters of r eads based on similarity thresholds, towards high-resolution amplicon sequence variants (ASVs) identified based on their unique biological sequences ( 16 ). Using ASVs not only reduces the computational bottleneck associated with sequence clustering but also facilita tes compara ti v e analysis across different studies. In differential abundance analysis, se v eral benchmar k studies hav e shown inconsistencies among methods de v eloped specifically for microbiome da ta, and tha t common RN Aseq anal ysis methods are robust and perform well ( 17 , 18 ). Finally, there is a growing demand for easy-to-use yet fle xib le tools that can account for complex metadata as well as to support multi-omics integration for microbiome studies (19)(20)(21).
To keep up with the progress and the evolving data analysis needs arising from recent microbiome studies, we have made significant updates to the MicrobiomeAnalyst platform, including three new modules: (i) a raw data processing module for marker gene data that links dir ectly to downstr eam statistical analysis; (ii) a microbiome metabolomics module for analysis of paired microbiome and metabolomics data, and (iii) a statistical meta-analysis module for multiple marker gene datasets. We have also made significant updates to the previous modules including support for complex metadata (metadata editor, continuous metadata, and multi-factor com-parison analysis), enhanced statistical approaches (functional prediction and correlation network analysis), new interacti v e visualizations (stacked bar plot, heatmaps and a KEGG metabolic network), and expanded taxon set libraries based on the latest literature. MicrobiomeAnalyst 2.0 is available freely at microbiomeanalyst.ca. It contains comprehensi v e tutorials, equipped with a dedicated user f orum (omicsf orum.ca). The underlying MicrobiomeAna-lystR package is also released ( https://github.com/xia-lab/ MicrobiomeAnalystR ) to facilitate transparent and reproducible analysis.

PROGRAM DESCRIPTION AND METHODS
The workflow of MicrobiomeAnalyst 2.0 consists of four main steps ( Figure 1 ). It supports common input types including raw amplicon sequencing data for 16S, 18S rRNA genes or internal transcribed spacer (ITS) region, a single count table generated from maker gene or shotgun metagenomics, paired microbiome and metabolomic data tables or lists, multiple maker gene count tables from compatible studies, or taxonomic signatures. After upload, all input data follows the same general workflow of data processing, method selection, and result exploration. Comprehensi v e options and analysis support are available at each step. In the following sections, we will focus primarily on the new or improved features introduced in version 2.0.

Amplicon sequencing data processing
High-throughput amplicon sequencing has yielded many insights into the de v elopment and progression of human diseases ( 3 ). It has become a ubiquitous method to study the complexity and diversity of microbiomes. Compared to shotgun metagenomics sequencing, the marker gene survey is both cost effective and computationally efficient, especiall y for highl y hetero geneous comm unities with many low-abundant species. Raw reads need to be first processed into OTUs or ASVs before downstream analysis. Se v eral tools have been developed for raw data processing including QIIME2 ( 22 ), Mothur ( 23 ) and D AD A2 ( 24 ). Howe v er, command line skills are required to use these tools. Micr obiomeAnalyst 2.0 intr oduces a new module with an automated pipeline based on the well-established D AD A2 workflow for processing amplicon sequencing data.
To start raw data processing, users can upload either single or paired-end compressed FASTQ files (.gz or .zip) from 16S / 18S / ITS sequencing. A metadata file in plain text format (.txt or .csv) is also r equir ed for further downstream statistical analysis. The workflow includes filtering, der eplication, sample infer ence, chimera identification, and merging of paired-end reads. MicrobiomeAnalyst 2.0 provides a parameter selection page to allow users to tune processing parameters based on quality control graphical outputs. Tax onom y annotation is based on se v eral r efer ence databases, including SILVA (v138) ( 25 ), Greengenes (13.8) ( 26 ) and RDP (release 11.5) ( 27 ) databases for 16S sequencing, UNITE database ( 28 ) for ITS sequencing, and SILVA (v132) ( 25 ) for 18S sequencing. When raw spectral processing is complete, summary graphics and detailed processing information are generated for individual samples. The resulting ASV and tax onom y tables can be downloaded or directly used as input for marker data profiling by clicking the module r edir ection button.

Integr ative analysis f or data fr om micr obiome metabolomics studies
Metabolites are key players in microbial communications and interactions with their hosts. Metabolomics is increasingly used in recent microbiome studies to connect microbial community compositions and phenotypes at the le v el of altered metabolic processes ( 1 , 2 , 29 ). Howe v er, integrating high-dimensional microbiome and metabolomics data remains a major challenge. To address this gap, Microbiome-Analyst 2.0 introduces a new module to allow users to explor e r elationships between the micr obiome pr ofiles and their metabolic products.
Users can upload either paired abundance tables or paired lists. For microbiome data, the input features can be OTUs, ASVs or KEGG Orthologs (KOs). For metabolomics data, the input features can be metabolites (targeted metabolomics) or LC-MS peaks (untargeted meta bolomics). For ta ble inputs, different data filtering and normalization methods are provided based on the input data types. MaAsLin2 ( 19 ) and limma ( 30 ) are employed for the statistical comparisons of microbiome and metabolomics data, respecti v ely. Both methods rely on general linear models to determine the associations between omics features and complex metadata, with support for covariate adjustments. List inputs are directly submitted to the name mapping step to pr epar e for the further integration analysis. Three strategies have been implemented for microbiome-metabolome integration --dimensionality reduction, metabolic network analysis, and correlation analysis.
Dimensionality reduction. Two robust dimensionality reduction methods, Procrustes analysis (PA) ( 31 ) and data integration analysis for biomarker discovery using latent components (DIABLO) ( 32 ), have been implemented to re v eal ov erall patterns between paired microbiome and metabolomics datasets. PA is an unsupervised method that superimposes the principal components of two datasets by rotating the axes of one dataset until the maximum similarity is achie v ed. DIABLO is a supervised method that aims to identify multi-omics components that maximally explain the variances of individual data and their covariance together with the metadata of inter est. The corr esponding r esults ar e pr esented in an interacti v e 3D scatter plot. Users can switch between score plots, loading plots, and biplots to visualize high-le v el trends, highlight results with different metadata, or identify features of interest.
Metabolic network analysis. This module aims to offer metabolic analysis contextualized based on the taxa or KOs present in the uploaded microbiome profiles. Users can customize the global metabolic networks based on statistically significant taxa or all taxa detected in the microbiome da ta. Alterna ti v ely, users can choose the generic (unfiltered) metabolic background based on the aggregated microbial metabolic network, or its combination with the host metabolic netw ork. Tw o well-established algorithms -mummichog ( 33 ) and globaltest ( 34 ) are used to perform enrichment analysis for LC-MS peaks and other featur es, r especti v ely. The r esults ar e visualized in an interacti v e global metabolic network, in which nodes r epr esent metabolites, edges r epr esent enzymatic r eactions, and r eactions that fall outside of the study-specific microbial potential or KO profiles are greyed out. Users can click any enriched pathway names in the table to highlight the corresponding metabolites or KOs on the network. User can also directly click a node (metabolite) in the network to view the most associated microbes displayed as a circle plot.
Micr obiome-metabolome corr elation analysis . This module supports statistical, model-based and integrated correlation analyses. For statistical correlation analysis, the default option is the distance-based correlation method which can detect both linear and non-linear correlations ( 35 ). Other options include Pearson, Kendall, and Spearman correlations and their corresponding partial correlations. The results are summarized as an interacti v e heatmap. Pairwise correlation analysis often leads to a high number of false positi v es, making biological interpretation difficult. To address this issue, we implemented a model-based correlation based on > 5000 high-quality genome-scale metabolic models (GEMs) to provide a probability heatmap between microbial taxa and their metabolites ( 36 ). Finally, users can choose to overlay the statistical and the model-based correla tion hea tmaps to integra te da ta-dri v en and knowledgedri v en streams of evidence.

Statistical meta-analysis across multiple microbiome studies
It is notoriously challenging to achie v e reproducib le featur es across differ ent microbiome studies due to the variations in experimental design, analysis methods and quantitati v e assessment ( 37 , 38 ). The statistical meta-analysis module aims to provide a frame wor k for integrating data Nucleic Acids Research, 2023, Vol. 51, Web Server issue W313 from multiple maker gene studies of the same phenotypes to help identify robust and reproducible features.
The data upload and processing steps are similar to the single marker data profiling workflow, with an additional verification step to ensure that all datasets and metadata are consistent. After processing, batch correction is performed to adjust for potential technical variations to increase the comparability of different microbiome studies ( 13 ). After this step, three meta-analysis strategies are offered -visual e xploration, biomar ker meta-analysis, and di v ersity metaanalysis.
V isual explor ation. This appr oach pr o vides stack ed area / bar plot and principal coordinate analysis (PCoA) plot to gi v e an ov ervie w of high-le v el patterns, while still allowing users to investigate sample-level details. Stacked area / bar plot offers a sample-le v el profiling of taxa abundance across all datasets to better understand taxonomic composition, while PCoA provides an ov ervie w of the similarities / dissimilarities in microbial composition between samples and datasets. Please note that the previous 'Projection to Public Data' module has been migrated to this page.
Biomar k er meta-anal ysis. The objecti v e of this approach is to integrate the results from differential abundance testing of individual datasets to identify common microbial signatures associated with phenotype(s) of interest. The method is composed of two parts: abundance testing in individual da tasets using multivaria te linear r egr ession followed by the integration of effect size using a random effects model based on the MMUPHin R package ( 13 ). The results are presented in the form of a bar plot displaying the top significant features along with a detailed table containing the statistical summaries of all features across individual studies.
Diversity meta-anal ysis. The a pproach integrates alpha and beta di v ersity indices across datasets. Common alpha di v ersity indices are computed for each study, and users can view ratios of indices between experimental groups using box plots and forest plots. Beta di v ersity indices are integrated by performing PCoA on common distance matrices from each study. Multiple statistical tests such as PERMANOVA ( 39 ), ANOSIM ( 40 ), PERMDISP ( 40 ) and MiRKAT ( 41 ) are available to measure significances on the effect of phenotype on community composition. Both graphical summaries and detailed tables are provided for alpha and beta di v ersity meta-analysis.

Other features
Multi-factor anal ysis f or comple x metadata. Microbiome datasets continue to increase in size with more complex experimental designs, and ther efor e mor e complex metadata. In addition, complex metadata are especially important for observational studies, where both continuous and categorical covariates are often measured. Therefore, we have invested significant effort to enhance metadata support in Mi-crobiomeAnalyst 2.0. A metadata panel was implemented on the data integrity check page for users to inspect and edit metadata variables, including specifying whether they are continuous or categorical. Users can also specify the order of group labels for ca tegorical metada ta. A multi-factor comparison tool based on general linear models was implemented using the MaAsLin2 R package ( 19 ). Users specify their primary metadata of interest, and can include covariates such as age, sex or technical factors to adjust for. Covariates can be modelled as either fixed or random effects. A linear model containing the primary metadata and all covariates are fit to each feature, and then statistics are extracted from the model for the primary metadata.
Impr oved corr elation analysis and function pr ediction. Several functions for marker gene profiling have been updated based on recent de v elopments in the field. MicrobiomeAnalyst 2.0 now offers se v en correlation methods for users to explore microbial relationships, including the recent Sparse Estima tion of Correla tions among Microbiomes (SECOM) method which provides measures of both linear and nonlinear relationships between microbes ( 10 ). For prediction of functional capacities from 16S rRNA gene abundance table, the previous version offered PICRUSt and Tax4Fun based on GreenGenes and SILVA tax onom y annotation, respecti v ely. In v ersion 2.0, we hav e upda ted the da tabase for PI-CRUSt to support annotation of > 200 000 OTUs against ∼7000 KOs. Tax4Fun2 is also available to allow users to predict potential functions directly from ASV sequences.
Enhanced visualizations for lar g e data explor ation. We implemented interacti v e plots for stacked bar / area plots and clustering heatmaps -those features are among the most fr equent r equests from our users for visual exploration of large datasets. Both mouse-over and zoom-in effects are supported to allow users to get details of the fea tures / pa tterns of interest. Another improvement is the updated KEGG metabolic network (Release 105.0) for improved visualization and functional analysis.
Expanded taxon set libr aries . The Taxon Set Enrichment Analysis (TSEA) module was created to allow r esear chers to identify taxonomic signatures characterized by their shared functions or associations with specific phenotypes to facilita te da ta interpreta tion and hypothesis genera tion. TSEA performs hypergeometric tests against a taxon set library of interest to detect the most frequently represented signatures from an input list of microbial features. In version 2.0, we have integra ted da ta from popular databases such as gutMDisorder ( 42 ), GIMICA ( 43 ) and MiMeDB ( 44 ), and expanded the list of phenotypic features to include 102 microbiome fea tures associa ted with immune responses , 77 microbiome-metabolite associations , 55 taxon sets associated with cancer, and 137 taxon sets associated with drug treatments. To improve the statistical power and biological relevance, we further consolidated taxon sets with at least four or more microbial members. The taxon set libraries now contain a total of 611 host-intrinsic features, 696 host-extrinsic features associated with diet, medication, and lifestyle, 500 associated with environmental features, and > 700 single-nucleotide polymorphism (SNP) associated taxon sets.

Case study
To showcase the new features in MicrobiomeAnalyst 2.0, we le v eraged a recent study on type 1 diabetes (T1D) ( 29 ). T1D is an autoimmune disorder that induces beta cell destruction and insulin deficiency ( 45 ). Previous studies showed that multiple factors can cause T1D such as genetic susceptibility, viral infections, dietary components, as well as gut microbiome ( 46 ). The objecti v e of the study was to investigate the impact of altered microbial communities in people with and without T1D. Both 16S marker gene sequencing and LC-MS-based metabolomics were performed. Using the accession numbers provided in the original paper, we downloaded raw sequencing data from the NCBI Sequence Read Archi v e (SRA) database, and the metabolite concentration table from the MetaboLights ( 47 ). Raw data processing was performed using our D AD A2 pipeline to get ASV a bundance ta bles and tax onom y annotations. The result was submitted for functional profiling based on the prediction by Tax4Fun2 ( 7 ). Two types of co-analysis were then conducted by integrating metabolite abundance with either the ASV count data or the KO a bundance ta ble. The genus le v el was used as an example to explain the results presented in Figure 2 . Figure 2 A shows the DIABLO biplot result presented in a 3D scatter plot. The composition of T1D and the contr ol gr oups overla p to a certain degree w hich is consistent with the original pub lication. Se v eral microbial taxa, such as Bacteroides and Alistipes, were observed to be associated with the top components. We hypothesize that these microbes dri v e the separation between T1D and non-diabetic subjects through certain metabolites. Detailed microbe-metabolite corr elations ar e pr esented by the overlayed heatmap (Figure 2 B). Only the features with an ad- justed P -value < 0.1 from the comparison analysis were used in this step. The statistical correlation was performed using the distance-based method and the AGORA database was selected for the GEM-based prediction result. With a significance cut-off of 0.05, we can observe that both approaches show Bacteroides significantly associated with glucose , glutamine , and se v eral amino acids. Alistipes also correlated with a different set of amino acids , which is consistent with the pattern found by DIABL O anal ysis . Although Bacteroides was not identified as a biomarker in the original paper, howe v er other studies hav e shown that it is related to diet and is a risk factor for early autoantibody de v elopment ( 48 ). Most studies focused on the compositional change of Bacteroides species in T1D without linking to function. Our analysis shows the metabolites significantly associated with Bacteroides, suggesting it potentially influences T1D through 'Alanine, aspartate and glutamate metabolism'. Figure 2 C shows the combined result of enrichment analysis from metabolites and KOs against the KEGG metabolic pathways using the globaltest method.
Se v eral pathways were detected by both datasets (highlighted in the left panel of Figure 2 C) including 'Vitamin B6 metabolism', 'Tyrosine metabolism', and 'Alanine, aspartate and glutamate metabolism'. The pathways that vary between the T1D and contr ol gr oup can be visualized within the network with different colors for each pathway. Taxa correlated with each metabolite can be visualized by clicking the corresponding node within the network. For example, the deficiency of pyridoxamine may impair insulin signaling ( 49 ). The top 10 most correlated genera such as Alistipes for pyridoxamine are shown in Figure 2 D. We note that the metabolites within 'Vitamin B6 metabolism' were not significantly different between the T1D and the control groups, howe v er the enrichment analysis can still identify the alteration at the pathway le v el.

Implementation
The w e b interface of MicrobiomeAnalyst 2.0 is implemented based on the JavaServer Faces framework using the PrimeFaces library ( https://www.primefaces.org , v12.0.0). The statistical functions and graphics are implemented using R (v4.2.2) and are freely available from the GitHub repositories ( https://github.com/xia-lab/ MicrobiomeAnalystR ). To accommodate the growing user traffic and computing demand, the system is deployed on a Google Cloud instance load balanced with a second computing node hosted at McGill Data Center. For the raw data processing, the job submission and scheduling are based on the Simple Linux Utility for Resource Management (SLURM) system.

Comparison with other tools
Se v eral w e b-based tools hav e been de v eloped for microbiome data analysis. Here we compare MicrobiomeAnalyst 2.0 with four other tools as well as to the previous version. Table 1 summarizes the main features of each tool. Popular tools dedicated to processing and archiving the raw sequence data, such as metagenomics rapid annotations using subsystems technology (MG-RAST) and MGnify (previously known as EBI Metagenomics) are not listed here ( 50 , 51 ). MicrobiomeAnalyst 1.0 ( 11 ) was de v eloped to address the needs for statistical analysis by providing a comprehensi v e list of functions and publica tion-read y graphics. Similar tools include analysis of microbial population structures (VAMPS), Namco and MIAN (52)(53)(54). Howe v er, only the newly built Namco has a comparable number of analysis options as MicrobiomeAnalyst 2.0. Global catalogue of metagenomics (gcMeta) ( 55 ) is designed to annotate and analyze raw data from both marker gene and shotgun metagenomics, and is supported by a large collection of multi-omics studies, howe v er it provides very limited analysis methods and no corresponding approaches for meta-anal ysis. Finall y, integrati v e analysis of microbiome and metabolomics data has addressed an urgent demand by the microbial comm unity. Overall, MicrobiomeAnal yst 2.0 is the most comprehensi v e w e b-based platform to allow user-friendly and streamlined microbiome data analysis and interpretation.

Conclusion
MicrobiomeAnalyst 2.0 has been de v eloped to meet the fast-evolving needs of microbiome data analysis. It provides a w e b-based platf orm f or r esear chers to easily explore and understand their data. To keep up with the latest de v elopments, we hav e updated the libraries for functional annotation, taxon set enrichment analysis and embedded se v eral recent statistical methods to enhance the modules de v eloped in v ersion 1.0. With the three ne w modules introduced in version 2.0, MicrobiomeAnalyst now supports streamlined analysis for marker gene data from raw data processing to downstream statistical and functional analysis. It also enables the integrative analysis for both paired microbiome-metabolomics datasets as well as multiple marker gene count tables. Our case study indicates that MicrobiomeAnalyst 2.0 can distill information from complex datasets to reveal the potential mechanic links between microbes and metabolites associated with T1D. Due to the internet bandwidth and large user traffic, the public server currently limits the maximum file size to 50MB for count tables and 100 raw sequence files per analysis session. We recommend using the MicrobiomeAnalystR package to r esear chers who plan to perform large-scale data analysis.
In the future, we aim to support more type of analysis, such as single cell data analysis or casual inference within the context of host genetics (56)(57)(58).

DA T A A V AILABILITY
MicrobiomeAnalyst 2.0 is freely available without registration or login requirements at https: //www.microbiomeanalyst.ca/ .