The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches

Abstract Background Metagenomic studies carried out in the past decade have led to an enhanced understanding of the gut microbiome in human health; however, the Indian gut microbiome has not been well explored. We analyzed the gut microbiome of 110 healthy individuals from two distinct locations (North-Central and Southern) in India using multi-omics approaches, including 16S rRNA gene amplicon sequencing, whole-genome shotgun metagenomic sequencing, and metabolomic profiling of fecal and serum samples. Results The gene catalogue established in this study emphasizes the uniqueness of the Indian gut microbiome in comparison to other populations. The gut microbiome of the cohort from North-Central India, which was primarily consuming a plant-based diet, was found to be associated with Prevotella and also showed an enrichment of branched chain amino acid (BCAA) and lipopolysaccharide biosynthesis pathways. In contrast, the gut microbiome of the cohort from Southern India, which was consuming an omnivorous diet, showed associations with Bacteroides, Ruminococcus, and Faecalibacterium and had an enrichment of short chain fatty acid biosynthesis pathway and BCAA transporters. This corroborated well with the metabolomics results, which showed higher concentration of BCAAs in the serum metabolome of the North-Central cohort and an association with Prevotella. In contrast, the concentration of BCAAs was found to be higher in the fecal metabolome of the Southern-India cohort and showed a positive correlation with the higher abundance of BCAA transporters. Conclusions The study reveals the unique composition of the Indian gut microbiome, establishes the Indian gut microbial gene catalogue, and compares it with the gut microbiome of other populations. The functional associations revealed using metagenomic and metabolomic approaches provide novel insights on the gut-microbe-metabolic axis, which will be useful for future epidemiological and translational researches.

--I commend the authors on their openness and responsiveness to the comments from both reviewers. The additional analyses performed and the clarifications in the manuscript have resulted in a much-improved draft. The availability of both the raw amplicon and WGS data in NCBI's Sequence Read Archive is a great service to the scientific community and should ensure open access to this data and its inclusion in future studies. The origin of the samples and the sequencing data is well-documented and cohort sample collection and storage appears to adhere to ethical and technical standards.
--More specifically, the updated manuscript addresses and corrects all points that I raised in the first review. These include a major reanalysis of 16S data with an updated database (SILVA December 2017 versus GreenGenes 2013) and an implementation of statistical analysis with normalization performed using DESeq2. They clarify their use of downsized data for Unifrac analysis. Further, the authors now combine their data and run downstream analysis using an "Updated-IGC." This clearly aids their analysis and broadens the appeal of the manuscript as a whole. The 9% of additional genes appears to be unique to the Indian cohort. The authors also performed the suggested enterotype comparisons with the data from Arumugam et al.
--Based on this new version of the manuscript, my recommendation is the manuscript can be accepted without further scientific revision. The authors should, nevertheless, have a careful review of the text to address remaining grammatical errors and awkward phrasing. A few, non-exhaustive, examples are given: The changes suggested by reviewers have been highlighted (orange coloured) in the manuscript.
Reply: We thank the reviewer for appreciating our efforts and recommending the manuscript for publication. We have carefully read the manuscript for any grammatical errors and have also made all the suggested changes in the manuscript text.
--Line 114: Change "All the recruited individuals" to "Recruited individuals" Reply: As per reviewer's suggestion, we have made this change (Line: 114).
--Line 365: "its inward transport in microbial cells by the BCAA transporters" would be better as "its uptake by microbes via BCAA transporters" Reply: We agree with the rephrasing and have revised this sentence as suggested (Line: 369).
--Line 408: Change "Though, the sequencing depth in the study was not too high..." to bacterial DNA.
Reply: We thank the reviewer for pointing it out. We have removed the word "bacterial" from this sentence (Line: 589). Metagenomic studies carried out in the past decade have led to an enhanced understanding of the 21 gut microbiome in human health, however, the Indian gut microbiome is not well explored yet. 22 We analysed the gut microbiome of 110 healthy individuals from two distinct locations  Central and Southern) in India using multi-omics approaches, including 16S rRNA gene amplicon 24 sequencing, whole genome shotgun metagenomic sequencing, and metabolomic profiling of faecal 25 and serum samples.

186
This could be attributed to the significant (FDR Adj. P-value = 0.00013) differences in MGS abundance profiles between LOC1 and LOC2 populations as revealed on comparison of their 188 principal coordinates (Additional File 5: Figure S2). 189 Further, the identification of enriched metagenomic species (MGS) from P-values calculated using  assigned with eggNOG annotations. The eggNOG abundance profile generated from relative 216 abundances of genes observed in Indian and other population dataset were ranked based on their 217 mean abundance in descending order. The range of eggNOGs that included 85% of the 1,890 218 essential genes were considered as a part of the core microbial eggNOG set for each population 219 dataset and was used for the analysis. Most of the essential genes were included in the top-ranking 220 clusters suggesting that the essential genes are present in higher abundance than the accessory 221 function genes (Additional File 5: Figure S7).

394
Notably, the BCAA biosynthesis-dependent changes in BCAA levels were largely driven by 395 Prevotella species through threonine-dependent and independent biosynthesis pathways as 396 observed from Delta SCCbg values when genes from this species were removed (see Methods).

397
The correlation network analysis with differential MGS/CAGs revealed threonine-independent 398 isoleucine biosynthesis pathway to be highly correlated with Prevotella copri in LOC1 (Fig. 6C).

481
The study also established the previously unknown faecal metabolome of the Indian population, 482 which showed strong clustering into three metabolomic clusters differentiated by location and diet.

491
The major BCAA 'isoleucine' being produced through a less common threonine-independent 492 pathway for isoleucine biosynthesis, and the higher enrichment of the key enzyme, D-citramalate 493 synthase of the above pathway confirmed its higher abundance in LOC1 as compared to LOC2.

494
Further, this pathway was found to be associated with a single species, Prevotella copri as reported     The quantification of gene content was carried out using the strategy performed by Qin et al.,[7] 607 where the high-quality reads were aligned against the updated IGC using SOAP2 in SOAP aligner 608 version 2.21 with an identity cut off ≥ 90% [67]. Two types of alignments were considered for 609 sequence-based profiling:

610
(1) The entire paired-end read mapped to the gene.

611
(2) One end of paired-end read mapped to a gene and other end outside genic region.

612
In both cases, the mapped read was counted as one copy.

613
The relative abundance of a gene within the sample was calculated as: a = b ∑ b 614 ai: relative abundance of gene in sample S; xi: The times in which gene i was detected in sample S 615 (the number of mapped reads); bi: copy number of gene i in sequenced data from sample S.

650
To identify metagenomic markers using a reference-independent approach on metagenomic 651 samples, a metagenome-wide association study was performed for 340 samples (age and gender 652 matched) including India (both locations), USA, China and Denmark populations. The genes 653 present in at least ≥10% of samples were considered and clustered using the canopy-mgs algorithm 654 as described [73]. The genes having Pearson's correlation coefficient (≥0.9) were clustered into 655 CAGs. Furthermore, the genes for which ≥ 90% abundance was obtained from a single sample 656 were discarded.

716
The oven temperature was increased at 10°C/min to 310°C for 11min and mass data were acquired 717 in full scan mode from m/z 40 to 600 with an acquisition rate of 20 spectra per second.

719
Raw CDF files were used for peak identification and filtering, and the XCMS package in R were 720 used for pre-processing of the peaks. First, the parameters used for pre-processing of the reads 721 were optimized by calculating the reliability index using the formula given below:

722
Reliability index = (number of reliable peaks) 2 /number of unreliable peaks.

723
The reliable peaks were identified for each of the settings such as fwhm, S/N and bw, with a 724 predefined range of values and regression coefficient was calculated for dilutions of QC samples.