Pairing metagenomics and metaproteomics to characterize ecological niches and metabolic essentiality of gut microbiomes

Abstract The genome of a microorganism encodes its potential functions that can be implemented through expressed proteins. It remains elusive how a protein’s selective expression depends on its metabolic essentiality to microbial growth or its ability to claim resources as ecological niches. To reveal a protein’s metabolic or ecological role, we developed a computational pipeline, which pairs metagenomics and metaproteomics data to quantify each protein’s gene-level and protein-level functional redundancy simultaneously. We first illustrated the idea behind the pipeline using simulated data of a consumer-resource model. We then validated it using real data from human and mouse gut microbiome samples. In particular, we analyzed ABC-type transporters and ribosomal proteins, confirming that the metabolic and ecological roles predicted by our pipeline agree well with prior knowledge. Finally, we performed in vitro cultures of a human gut microbiome sample and investigated how oversupplying various sugars involved in ecological niches influences the community structure and protein abundance. The presented results demonstrate the performance of our pipeline in identifying proteins’ metabolic and ecological roles, as well as its potential to help us design nutrient interventions to modulate the human microbiome.


Introduction
Metagenomic sequencing has enabled the measurement of the genomic content and functional potential of microbial communities at an unprecedented rate, aiding the understanding of their role in host health [1][2][3] and biogeochemical cycling [4][5][6].Although various computational approaches based on these genomes quantify interactions within microbial communities [7][8][9][10][11][12] and analyze the functional redundancy (FR) and functional stability of microbial communities [10][11][12], they focus on potential rather than actual function, as microorganisms only express a subset of genes as proteins [13].Recent advancements in highthroughput metaproteomics allow us to quantify protein abundances in human gut microbiomes [14], offering insights into gene expression in response to environmental changes when paired with metagenomic data.
From the metabolic perspective, some genes and their encoded proteins are indispensable for cell metabolism under any conditions, as microbial growth halts without these essential functions-aminoacyl-tRNA synthetase [15,16], ribosomal proteins [17][18][19], and enzymes involved in glycolysis [20,21].From the ecological perspective, gene expression is inf luenced by ecological selection, with specific proteins indicating which resources a microbe can utilize and defining its ecological niche.For instance, Escherichia coli prefers glucose over lactose due to the repressed expression of lactose-utilizing enzymes, even though it can use both sugars [22,23].Such specialization of consuming one resource caused by the selective gene expression may reduce the niche overlap with other species and allow microbial coexistence, as seen with two E. coli strains where one expresses acetylcoenzyme synthetase (Acs) [24,25] to consume acetate produced by the other [26][27][28][29].
Understanding the selective expression of microbial genes is an outstanding question in microbiology.Does the behavior of selective expression of microbial genes differ between metabolic function (e.g.essential for microbial growth metabolism) and ecological function (e.g.claiming resources as a niche)?To answer this, we developed a computational method to analyze paired metagenomic and metaproteomic [14,[30][31][32][33] data, constructing the gene content network (GCN) or protein content network (PCN)-a bipartite graph that connects microbial taxa to their genes or expressed proteins, respectively (Fig. 1A and B).For each gene and its encoding protein, we compared its gene-level (or protein-level) FR, revealing each protein family's metabolic or ecological role.Our method, validated with several gut microbiome data, accurately predicts that ABC-type transporters are related to ecological niches [34][35][36], and ribosomal proteins are essential [17][18][19].Finally, we performed in vitro culture experiments using human gut microbiome samples to investigate how oversupplying sugars involved in ecological niches inf luence community structure and protein expression.

In vitro human gut microbiota culture and metaproteomics
Three healthy individual microbiota samples were collected and biobanked [37].The frozen microbiome samples were cultured in our optimized culture medium [38] with or without the presence of different sugars in technical triplicates, and were taken at different times for optical density and metaproteomic analyses.For single-strain samples, proteins were extracted with 4% Sodium Dodecyl Sulfate (SDS) 8 M urea buffer in 100 mM Tris-HCl buffer, followed by precipitation and acetone washing.Proteins were digested with trypsin desalted [39] for Liquid Chromatography-Tandem Mass Spectrometry (LC-MS/MS) analysis using an Orbitrap Exploris 480 mass spectrometer.For the cultured microbiomes, an automated process extracted and purified proteins, which were then digested, desalted, and quantified using TMT11plex [40], ensuring mixed representation in labeling to avoid bias.Samples underwent a 2-h LC gradient and were analyzed by mass spectrometry.More details can be found in Supplemental Methods.

Datasets
Metagenomics data from four individual microbiomes were obtained from the previous MetaPro-IQ study [14,33] (accessible from the National Center for Biotechnology Information (NCBI) sequence read archive under the accession of SRP068619), and the same samples were reanalyzed by an ultra-deep metaproteomics approach [14] via the PRIDE partner repository [41] with the dataset identifier PXD027297.Proteomics dataset of the cultured singles strain samples has been deposited to ProteomeXchange Consortium with the identifier PXD037923.Metaproteomic dataset of the RapidAIM-cultured microbiome samples has been deposited to ProteomeXchange Consortium (identifier PXD037925).The metaproteomic dataset of the mouse gut microbiome comprising 20 gut microbes is derived from a previous study [42] that was deposited to ProteomeXchange Consortium with the dataset identifier PXD009535 and to MassIVE with the dataset identifier MSV000082287.

Database search and data processing
Proteomics database searches used FASTA databases of the individual strains downloaded from NCBI and MaxQuant [43] 1.6.17.0 for analysis, without the label-free quantification.Metaproteomic database searches of cultured microbiome samples were performed using MetaLab V2.2, and the [44] MaxQuant option was used to search the Tandem Mass Tag (TMT) dataset against the integrated gene catalog (IGC) database of the human gut microbiome.The resulting data table was normalized using R package MSstatsTMT [45], and missing values were imputed using R package DreamAI [46].The "fraction" of each taxon-specific protein is computed by dividing the protein intensity by the sum of the intensities of all proteins assigned to the same taxon.The log2 fold change of each protein is obtained by taking log2 of the ratio between its fraction in the treatment group (with added sugars) and its fraction in the control group (without added sugars).

Statistics
To calculate correlation throughout the study, we used Pearson's correlation coefficient.All statistical tests were performed using standard numerical and scientific computing libraries in the Python programming language (version 3.7.1)and Jupyter Notebook (version 6.1).

Specialist function, niche function, and essential function
Here, we define three types of functions for protein families that we would like to categorize: (i) "Specialist function": specialized by only a few taxa and not widely shared within a community.(ii) "Niche function": arising from ecological competition, widespread among genomes of numerous taxa but selectively expressed under specific ecological conditions.(iii) "Essential function": metabolically indispensable for and widely shared by many taxa within a microbial community.We emphasize that our definition is not exhaustive; some proteins may display attributes of multiple categories or not align precisely with any single category.Using a simple hypothetical example of two competing species (Fig. 1A and B), we demonstrated the three function types: (a) the blue protein is a specialist function since it is solely encoded in the pink species' genome; (b) the red protein belongs to a niche function due to its selective expression by the yellow species even though the protein is encoded in the genomes of both species; (c) the green protein is an essential function because both species need it for biomass synthesis.In coexistence, the pink species specializes in the blue resource, avoiding competition with the yellow species for the red resource.

GCN, PCN, and network degree
We can identify the functional types of proteins in this hypothetical case by comparing the structure of the GCN and PCN (Fig. 1A and B).For example, consider the protein responsible for converting red resource to green metabolite (red broken circle in Fig. 1A and B), its degree in the GCN k GCN = 2, while its degree in the PCN k PCN = 1.This degree reduction is due to distinct ecological niches being occupied by two species when they are cocultured.By contrast, the protein responsible for assimilating critical green metabolites (green broken circle in Fig. 1A and B) into biomass does not show a degree reduction (k GCN = k PCN = 2) because it is essential for microbial growth.Similarly, since the blue protein is only specialized by the pink species, its k GCN = k PCN = 1.Thus, three function types occupy different regions in the k GCN vs. k PCN plot (Fig. 1C).

Quantifying gene-and protein-level FR of each gene and its encoded protein
However, the network degree does not consider the significant impact of the microbial taxonomic profile, which provides details about the makeup of a microbial community.This profile is represented by p = p 1 , . . ., p N , where p i is the relative abundance of taxon-i and N i=1 p i = 1.For a given gene and its encoded protein, we can define its gene-level FR (FR g ) and protein-level FR (FR p ) Figure 1.Protein functions involved in determining ecological niches are postulated to have larger discrepancies between the gene-level functional redundancy FR g and protein-level functional redundancy FR p ; here we use a hypothetical example with three representative proteins (three broken circles with complementary shapes to their substrates) to demonstrate this point; (A) schematic of the genomic capacity of two microbial taxa (oval vs. indented oval); two resources (pentagon and triangle) are externally supplied to the community; the round-shaped metabolite can be transformed from either resource and further utilized in biomass synthesis; the taxon on the left has the capacity of converting either supplied resource into the metabolite, while the taxon on the right can only convert the pentagon-shaped resource; (B) schematic of expressed proteins for two microbial taxa after their competition in the same community; after the competition, the reduced resource conf lict (represented by the taxon on the left choosing the triangle-shaped resource as the sole one to consume) can promote their coexistence; GCN and PCN can be used to capture genomic capacity and expressed protein functions for all taxa; alternatively, this network can be represented as incidence matrices on the bottom (i.e., the presence/absence of edges connecting taxa to proteins); (C and D), the comparison between k GCN and k PCN or between FR g and FR p helps to classify proteins into three protein functional types: specialist function, essential function, and niche function.In the calculation of FR g and FR p , we assume equal abundances of the two species, i.e. p 1 = p 2 = 0.5; (E) the comparison between FR g and FR p when p 1 = 0.9 and p 2 = 0.1; (F) the pipeline of assigning the functions (specialist, niche, or essential) to protein families based on the paired metagenomes and metaproteomes; each individual's gut microbiome sample was subjected to DNA and protein extraction; then a protein-peptide bridge approach can be used for generating the GCN based on the metagenome and PCN based on the metaproteome; when matched metagenomes are available, taxonomic and functional annotations of the metagenomes can be used for PCN generation; based on the generated GCN, PCN, and taxonomic profile, FR g and FR p can be computed and used for the function assignment.within this sample as and ) is the distance between taxon-i and taxon-j based on their genomic capacity to express this gene (or the presence of the protein).For simplicity, we assume d GCN for each protein, different from our previous studies where FR was calculated by including all genes or proteins in a microbial community [12,14].
Comparing FR g and FR p provides deeper insight into proteins' function types.For the red protein in our hypothetical example, d GCN   12   = 0 and d PCN  12 = 1 because both species share the potential to express the gene, while only the yellow species have expressed it (Fig. 1A and B).As a result, FR g = 2 (1 − 0) p 1 p 2 = 2p 1 p 2 and FR p = 2 (1 − 1) p 1 p 2 = 0 (Fig. 1D and E).Following the same analysis, FR g = FR p = 2p 1 p 2 for the green protein, and FR g = FR p = 0 for the blue protein (Fig. 1D and E).Different from compositionindependent k GCN and k PCN , FR g and FR p take the microbial composition into account and thus are more ecologically meaningful.Notably, a more uneven abundance distribution would lead to smaller FR g and FR p (p 1 = p 2 = 0.5 in Fig. 1D; p 1 = 0.9 and p 2 = 0.1 in Fig. 1E).The inf luence of relative abundances on FR can be mitigated by using the normalized FR: nFR = FR / TD, where TD = 1− i p 2 i (Supplementary Fig. 1; see Supplementary Information for the definition).

Overview of our computational pipeline
Following the idea of comparing FR g with FR p , we developed a computational pipeline to assign the function types (specialist, niche, or essential) to protein families based on the paired metagenome and metaproteome (Fig. 1F).This pipeline starts with DNA sequences from metagenomes and peptides sequences from metaproteomes.Using the "protein-peptide bridge" approach that maps peptides to their taxonomic origins and protein families (i.e.orthologous protein clusters), it generates the GCN, PCN, and taxonomic profile, from which we compute FR g and FR p .Details about this approach can be found in the Supplementary Information.Finally, based on the scatterplot of FR g vs. FR p , each protein family is categorized into one of the three function types.Note that the computational pipeline assigns the function type without leveraging the known biological functions.Instead, we validate these assignments against the knowledge about biological functions.

Illustration of our computational pipeline using synthetic data
To illustrate the pipeline's workf low, we utilized synthetic data generated by a consumer-resource model (CRM).Each niche (or specialist) function is modeled as the consumption of a unique and externally supplied resource (Fig. 2A1), whose loss would make a species unable to consume the corresponding resource (Fig. 2A2 and A3).The loss of an essential function is modeled as reducing a species' growth rate by 5% (Fig. 2A4).
For each species, each niche (specialist, or essential) function was assigned to the species' genome with probability p n (p s or p e ), respectively (Fig. 2B, left).We set p n = p e = 0.7 to ensure that we cannot distinguish niche functions from essential functions only based on their k GCN .We set p s = 0.2 < p n = p e so that specialist functions were assigned to fewer species than niche and essential functions.Species' actual expressed functions were determined by randomly sampling a subset of its potential functions (Fig. 2B, middle).This behavior of sub-sampling was observed when we cultured single microbial strains in different environments (Supplementary Fig. 21).We simulated community dynamics until reaching a steady state, for which we constructed the PCN of the surviving species (Fig. 2B, right; see Supplementary information for technical details).More technical details of CRM are in Supplementary Information.
In our model with 10 000 species and 20 functions for each of the three function types, each species randomly sampled a subset of potential functions (Fig. 2C, left) to express (Fig. 2C, middle).We demonstrated a simulation example with 35 species surviving in the final steady state after the community assembly initialized with 10 000 species (Fig. 2C, right).
We applied the taxonomic profile, GCN, and PCN for the surviving 35 species to our computational pipeline, finding that the three modeled protein function types were correctly classified as three clusters (60 out of 60 were correct) by the Gaussian mixture model in both the comparison of network degree (Fig. 2D) and FR (Fig. 2E).We emphasize that the observed three functional clusters arise from community assembly.When we randomly picked 35 species (same as the number of surviving species) from the initial pool with equal abundances without assembly, niche functions cannot be distinguished from essential functions (Fig. 2F and G).Even when assigning the same randomly picked 35 species with the same abundances of surviving species, we still cannot differentiate these two functional types (Supplementary Fig. 2).Our findings held when varying (1) the number of species and functions or (2) model parameters p n , p s , and p e , affirming the method's robustness in distinguishing function types (Supplementary Figs 3-5).

Three protein functional clusters observed in human gut microbiomes
Next, we validated our computational pipeline on real data of human mucosal-luminal interface samples previously collected from the ascending colon of four children [14,33].Here we focused on the genus level and annotated the identified proteins from metagenomic and metaproteomic data via the clusters of orthologous genes (COGs) database [47,48].We chose the genus level due to widely shared peptide sequences across species (Supplementary Fig. 6).We searched metagenomic reads and metaproteomic peptides against the IGC database of the human gut microbiome [49] to generate the GCN and PCN [14] and took the intersected COGs between the two networks.Taxonomic assignment was performed using the "proteinpeptide bridge" method as described previously [14].Our analysis centers on subject HM454, identifying 1542 intersected COGs in both the GCN and PCN and obtaining a taxonomic profile of 85 genera using MetaPhlAn2 [50].The connectance (i.e. the number of edges divided by the maximal number of possible edges) of the GCN (or PCN) is 0.220 (or 0.049), respectively (Fig. 3A and B).The GCN displayed a higher nestedness (nestedness metric NODF [51]=0.667)than the PCN (NODF = 0.453).More details about data processing and NODF are in Supplementary Information.species (ovals and indented ovals) with expressed gene functions selected via the sub-sampling of their genomic capacity; then all species are co-cultured together to simulate their ecological competition; (C) a simulation example of the community assembly, and the construction of GCN and PCN for the survived species; (D and E), the comparison of network degree and FR, respectively, based on the GCN and PCN of survived species in the simulation example in panel-C; a Gaussian mixture model with three clusters is used to identify three protein functional clusters; ellipses around clusters cover areas one standard deviation away from their means; (F-G) the comparison of network degree and FR, respectively, based on the GCN and PCN of 35 species randomly selected from the 10 000 species in the initial pool; all points/functions are colored red (niche functions), green (essential functions), and blue (specialist functions) according to their types of functions in the model; k GCN (or k PCN ) is the network degree of each function in the GCN (or PCN); FR g (or FR p ) is the FR of each function on the gene level (or protein level), respectively.The network degree analysis revealed a general decline from GCN to PCN, with 804 out of 1542 COGs having k PCN < 0.2k GCN (Fig. 3C; Supplementary Data 1).This decline greatly inf luences FR but does not fully explain why many COGs have FR p ∼ 0 (744 out of 1542 have FR p < 0.01 in Fig. 3D) and k PCN nonlinearly correlates with FR p (Fig. 3E).For example, for L-arabinose isomerase (COG2160), its k PCN [7] is fairly close to k GCN [8], but its FR p (0.04) is much lower than FR g (0.23) since the genus Blautia (relative abundance = 22%) did not express L-arabinose isomerase, even if it has this capacity encoded in its genome.
Using the Gaussian mixture model fitted on simulated data, we categorized all protein families into three clusters (Fig. 3C and D).Although clusters on real data are not as distinct as on simulated data, the relative positioning of the three clusters (shaded areas in Fig. 3C and D) agrees well with our hypothesis (Fig. 1).The weaker clustering might result from a greater variation in k GCN (or FR g ) for real data (Fig. 3C and D) than that for simulated data (Fig. 2D and E).Some COGs have FR p > FR g (Fig. 3C and D), contradicting the sub-sampling argument for the gene expression.FR p should not exceed FR g if the PCN was a proper subgraph of the GCN.This contradiction may stem from limitations in metagenomic sequencing and metaproteomic identification depths, as both metagenomics and metaproteomics require sufficient depth to detect genes or proteins, respectively.We tested how the lower detection capability of metaproteomics or metagenomics inf luences the FR by varying the protein or gene abundance percentile threshold (PAPT or GAPT), which denotes the percentage of most abundant proteins or genes being kept.As PAPT or GAPT decreases, FR p or FR g drops respectively (Supplementary Fig. 15).When GAPT decreases, we observed more proteins with FR p greater than their FR g .

Validating three functional clusters observed in human gut microbiomes
Our computational pipeline accurately assigns functional clusters for protein families, agreeing with their known biological functions.For example, COG0539 (ribosomal protein S1) was assigned as the essential function, which is essential for translational initiation [17-19, 52, 53].Another example is the assignment of COG1116 (ABC-type nitrate/sulfonate/bicarbonate transport system) [34] as a niche function, whose expression has been shown to be selectively enriched for a few microbial species [54].
The pipeline's classifications were systematically validated against well-established biological roles of specific protein families: (i) ABC-type transporters are niche proteins due to their connection with ecological metabolic niches [34][35][36]; (ii) ribosomal proteins are essential proteins because they are indispensable for the microbial growth [55,56]; (iii) PTS (phosphotransferase system) proteins are specialist proteins because an evolutionary study has shown that various species within the same genus even possess a different set of PTS proteins [57].To evaluate our pipeline's performance, we quantified its accuracy in assigning these protein families (ABC-type transporters, ribosomal proteins, or PTS proteins) against the assumed "ground-truth" functions (niche, essential, or specialist functions, respectively).
For HM454, our computational pipeline based on the FR g /FR p plot correctly categorizes 81 of 122 COGs belonging to ABC-type transporters, ribosomal proteins, or PTS proteins.In comparison, when classifying functions based on the k GCN /k GCN plot, 74 COGs were correctly assigned, slightly worse than that based on the FR g /FR p plot.Specifically, 26 of 53 COGs belonging to ABC-type transporters are classified as niche functions.The fraction of ribosomal proteins classified to the cluster of essential functions is 83.0%(=44/53).For the PTS proteins, among the identified 16 COGs, 11 are classified as specialist functions.

Alternative clustering and classification methods
Alternatively, we explored the unsupervised K-mean clustering with K = 3, which captured the three representative functional clusters with their positions agreeing with our expectations (Supplementary Fig. 13).We also designed a supervised classifier based on quadratic discriminant analysis (QDA).QDA, trained on ABC-type transporters, PTS proteins, and ribosomal proteins as the niche, specialist, and essential functions, generated clusters closely resembling those from the Gaussian mixture model (Supplementary Fig. 14).For HM454, the K-mean clustering categorizes 48 of these 122 COGs that are ABC-type transporters, ribosomal proteins, or PTS proteins into clusters, respectively, representing niche, essential, or specialist functions (i.e. the accuracy is 39.3%).For the QDA classifier, the accuracy is 59.0% (=72/122).Thus, we select the Gaussian mixture model as the classification method because of its superior accuracy (66.4% = 81/122).

Comparing FR g with FR p identifies ecological niches and metabolic essentiality
We focused on analyzing ABC-type transporters [34][35][36] and ribosomal proteins [17][18][19].ABC-type transporters are energyrequiring transporter proteins that allow microbes to exploit specific niches like glucose uptake [34][35][36].For HM454, we indeed found that k GCN for all ABC-type transporters is much larger than their k PCN (Fig. 4A).Similarly, we also found that their FR g values are much larger than their FR p values, classifying many transporter proteins as niche functions (Fig. 4B).Some transporter proteins were classified as specialist functions (blue dots in Fig. 4B) due to the specialization on the gene level, which is carried to the protein level.Some transporter proteins were classified as essential functions (green dots in Fig. 4B).One example is the ABC-type Fe 3+ /spermidine/putrescine transporter (COG3842), as iron is essential for bacteria to function as a co-factor in ironcontaining proteins [58,59].
Ribosomal proteins, critical for protein synthesis and microbial growth [55,56], showed little variance between k GCN and k PCN (the mean and the standard deviation of the relative difference kGCN−kPCN kGCN is 0.09 ± 0.41; Fig. 4E), with most correctly classified as essential (44 out 53 COGs in Fig. 4F).Notably, two ribosomal proteins (L28 and L34), which have been reported as non-essential to microbes such as E. coli [17,53,60], were accurately classified as non-essential proteins (red dots in Fig. 4E).Certain specialized ribosomal proteins in microbial genomes continue to be specialized on the protein level and thus were classified as specialist functions.
Alternatively, we looked at the distribution of network degrees (Fig. 4C and G) and FR (Fig. 4D and H).For ABC-type transporters, the distribution of k PCN is close to 0 (median of 2), while the median of k GCN is 25.For ribosomal proteins, the distribution of k PCN (median is 12) is similar to k GCN (median is 14).For ABCtype transporters, the distribution of FR p is close to 0 (with a median ∼ 0.01), while the median of FR g is around 0.30.For ribosomal proteins, the distribution of FR p (median ∼ 0.20) is similar to the distribution of FR g (median ∼ 0.21).
The similar patterns are also true for the other three individuals (Supplementary Figs 8-10; Supplementary Data 2-4).Variations in the FR g /FR p plot across individuals (Supplementary Figs 8-10) are likely due to differences in gut environments, diets, and microbial composition.Note that HM503 is an outlier due to its lower diversity (36 genera versus the average of 56.7) (Supplementary Fig. 11).Similarly, the Shannon diversity index for HM503 is only 1.41, lower than other individuals (mean and standard deviation are 2.05 and 0.18).The lower diversity in HM503 leads to fewer taxa owning the same function on average, resulting in lower FR g and FR p values.
Extending our analysis to additional protein families, we discovered that proteins linked to glycolysis, RNA polymerase, and the Tricarboxylic Acid cycle (TCA) cycle displayed patterns similar to ribosomal proteins (Fig. 4I and J).By contrast, proteins associated with sugar utilization, glycosyl hydrolase, and aromatic amino acid biosynthesis exhibited patterns more akin to ABCtype transporters.We also analyzed the classification results for the unknown COG functions (with the category "S: Function Unknown").Out of the 104 unknown COGs in our analysis, the majority are classified as specialist (43/104) or niche functions (40/104), with a smaller portion identified as essential functions (21/104).
We also confirmed our results using the KEGG Orthology (KO) annotation [61][62][63][64], which has a lower annotation rate (78%) than COG (92%).For HM454, our computational pipeline categorizes 75 of these 126 KOs that are ABC-type transporters, ribosomal proteins, or PTS proteins into clusters, respectively, representing niche, essential, or specialist functions, similar to the classification accuracy based on the COG.The contrasting difference between ABC-type transporters and ribosomal proteins is well preserved (see Supplementary Fig. 7).Additionally, the distribution of FR p shows a dramatic difference across KO groups (Supplementary Fig. 12).Some ecologically strongly selected KO groups such as ABC transporters have small FR p (Supplementary Fig. 12).As a comparison, proteins from aminoacyl-tRNA biosynthesis [15,16], glycolysis [20,21], and ribosomes [17][18][19] have large FR p and huge variations within each group (Supplementary Fig. 12).

Validating our method on the mouse gut microbiome
In testing our method's feasibility in other microbial communities, we leveraged a metaproteomic dataset from mice gavaged with a synthetic microbiome comprising 20 sequenced bacteria [42].Since this study lacks paired metagenomes, we used its 16S rRNA gene sequencing data with individual genomes to infer Comparison of network degree and FR between the gene and protein level for many protein families including ABC-type transporters and ribosomal proteins from the human gut microbiome; (A) network degrees in GCN are larger than network degrees in PCN for most ABC-type transporter COGs; k GCN (or k PCN ) is the network degree of each COG in the GCN (or PCN); (B) FR g is larger than FR p for most ABC-type transporter COGs; (C and D) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ABC-type transporter COGs shows a significantly huge reduction from k GCN to k PCN or from FR g to FR p ; (E) network degrees in GCN are comparable with that in PCN for most ribosomal protein COGs; (F) FR g is comparable with FR p for most ribosomal protein COGs; points in scatter plots are colored by the same colors used in Fig. 3d;  (G and H) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ribosomal protein COGs shows no significant reduction from k GCN to k PCN or from FR g to FR p ; (I) the fraction of assigned specialist, niche, or essential functions based on comparing network degrees k GCN and k PCN for many protein families; (J) the fraction of assigned specialist, niche, or essential functions based on comparing functional redundancies FR g and FR p for many protein families; in all boxplots, the middle white dot is the median, the lower and upper hinges correspond to the first and third quartiles, and the black line ranges from the 1.5 × IQR (where IQR is the interquartile range) below the lower hinge to 1.5 × IQR above the upper hinge; all violin plots are smoothed by a kernel density estimator and 0 is set as the lower bound; all statistical analyses were performed using the two-sided Mann-Whitney-Wilcoxon U test with Bonferroni correction between genomic capacity (GCN) and protein functions (PCN); P-values obtained from the test is divided into five groups: (1) P > .05(ns), (2) .01< P ≤ 0.05 ( * ), (3) 10 −3 < P ≤ .01 ( * * ), (4) 10 −4 < P ≤ 10 −3 ( * * * ), and ( 5 E-H) comparison between k GCN and k PCN , comparison between FR g and FR p , the distribution of network degrees, and the distribution of functional redundancies for ABC-type transporter COGs; (I-L) comparison between k GCN and k PCN , comparison between FR g and FR p , the distribution of network degrees, and the distribution of functional redundancies for ribosomal protein transporter COGs; (M) the fraction of assigned specialist, niche, or essential functions based on comparing network degrees k GCN and k PCN for many protein families; (N) the fraction of assigned specialist, niche, or essential functions based on comparing functional redundancies FR g and FR p for many protein families; in all boxplots, the middle white dot is the median, the lower and upper hinges correspond to the first and third quartiles, and the black line ranges from the 1.5 × IQR (where IQR is the interquartile range) below the lower hinge to 1.5 × IQR above the upper hinge; all violin plots are smoothed by a kernel density estimator and 0 is set as the lower bound; all statistical analyses were performed using the two-sided Mann-Whitney-Wilcoxon U test with Bonferroni correction between genomic capacity (GCN) and protein functions (PCN); P-values obtained from the test are divided into five groups: (1) P > .05(ns), (2) .01< P ≤ .05( * ), (3) 10 −3 < P ≤ .01 ( * * ), (4) 10 −4 < P ≤ 10 −3 ( * * * ), and (5) P ≤ 10 −4 ( * * * * ).its metagenome.Here we focused on the strain level because peptides of different strains in this simple synthetic gut microbiome can be distinguished.We relied on the comparison between FR g and FR p to generate the distribution of functional clusters across many protein families for this dataset (Fig. 5N).The results mirrored those of human gut microbiomes, especially the contrasting patterns between ABC-type transporters and ribosomal proteins (Fig. 5).

Response of community and protein abundance to the introduction of sugars
After identifying niche functions through our computational pipeline, we explored using nutrients associated with niche functions to manipulate the community structure.In ecology, a niche is often defined as an abiotic and biotic factor that supports the survival of species [9,[65][66][67].Therefore, niche functions are associated with corresponding limiting resources involved in those functions.For example, COG1879 (ABC-type sugar transporter) is categorized as a niche function due to microbial competition for sugars (Supplementary Fig. 17).Here, we leveraged the in vitro community and studied how expression levels of ATP-type transporters respond to supplied sugars so that a microbial taxon can achieve a better living strategy.
Using the RapidAIM V2.0 approach [68], which replicates the functional profiles of individual gut microbiomes in vitro [38], we cultured three individual human gut microbiota samples and used a semi-automated metaproteomics workf low to observe how taxon-specific proteins respond to the presence of glucose, fructose, and kestose (Fig. 6A).Samples were cultured in technical triplicates, and protein abundances were quantified at 0, 1, 5, 12, and 24 h using 11-plex tandem mass tag (TMT11plex) [40] for a total of 189 samples.We analyzed the Bray-Curtis dissimilarity of metaproteomes over time, finding that more complex sugars induce more pronounced alterations in protein profiles (Supplementary Fig. 16).
To ref lect the effect of introduced sugars on protein abundances, we used log2 of fold change in normalized protein abundances/intensities (see Supplemental Methods for details) between the treatment and control group (Fig. 6).We hypothesized that excessive sugars remove the growth limitation on carbon resources, prompting microbes to upregulate other transporters for uptaking more other scarce resources (e.g.nitrogen or amino acids) for better growth (Fig. 6A).We analyzed log2 fold changes of ABC-type transporters 5 h later after sugar introduction (Fig. 6B-D), with most COGs close to zero.Among seven significantly inf luenced COGs, COG1126 (ABC-type polar amino acid transport system) is the only one that is revealed to be a niche function.Focusing on COG1126, we found that it is specialized by the genus Holdemanella.Holdemanella benefits from upregulating COG1126, as the proportion of Holdemanella proteins significantly increases from 13.5%(± 0.06%) for the control to 15.8%(± 0.08%) with the added glucose (P-value = .04,Mann-Whitney U test applied).
We observed that adding fructose, glucose and fructose, or kestose alters ABC-type transporters' expression similarly to glucose alone (Fig. 6).The correlation in log2 fold changes of ABCtype transporters between different added sugars is significant (Supplementary Fig. 18).Notably, complex sugars trigger more significant fold changes in microbial protein expression.This pattern persists for metaproteomic measurements 12 and 24 h later, while the fold changes 1 hour later are less significant (Supplementary Fig. 19; P value <.01 for four sugar-adding scenarios, Mann-Whitney U test applied).Additionally, the overwhelmingly positive log2 fold changes of ribosomal proteins (Supplementary Fig. 20; P-value <10 −4 for four sugar-adding scenarios, one-sample Wilcoxon test applied) probably imply faster microbial growth when simple sugars are supplied [69,70].

Discussion
We developed a computational pipeline to classify protein families as specialist, niche, and essential functions by comparing FR g with FR p .This approach supplements traditional methods that test metabolic essentiality by gene knockout [17][18][19] and identify limiting resources by measuring biomass changes upon resource supplies [71][72][73][74].We first illustrated this method on synthetic data and then validated it using real datasets of human and mouse gut microbiomes.We acknowledge a limitation in our validation process-the reliance on limited available literature.Hence, our classification should be seen as a preliminary framework that is open to refinement as the investigation of protein families' functions improves.
Our findings bridge the gap between the ecological niche theory, which posits that each resource (or niche) can only be occupied by one species for steady-state conditions [67,75,76], and the FR revealed by shared functions among microbial genomes [11,12].We solved this dilemma by showing niche proteins usually have very small FR p and large FR g .Additionally, our ecological framework combines genomic capacity and protein functions together by introducing species with subsampled functions.The model framework accounts for selective expression due to different environmental conditions [77][78][79] or evolved strains with distinct metabolic niches [26,27,80], reconciling phenotype-focused ecological models with genetic data.
The observed case of FR p > FR g for some COGs could stem from using MetaProIQ, a general microbiome catalog, for metaproteome analysis.Although MetaProIQ facilitates the identification of proteins from various gut microbes, the general search against it may cause anomalies.In contrast, directly searching metaproteomic data against the gene calls from the paired metagenome may deliver more accurate identifications.However, this strategy suffers when the metagenomic sequencing is incomplete, leading to undetected proteins due to missing genes.In our human gut microbiome datasets, limited sequencing depth might cause such issues.For the synthetic mouse gut communities with complete genomes for each microbial strain, we matched metaproteome to microbial genomes and did not encounter any case of FR p > FR g , highlighting the effectiveness of this approach in contexts with complete genomic data.
In this work, we only validated our pipeline on gut-related biomes due to the limited accessibility of paired metagenomemetaproteome in other environments.In nutrient-poor environments, where ribosomal proteins are less expressed [81,82], essential proteins like ribosomes may be harder to detect, causing potential detection biases.This aspect warrants further investigation to ensure the robustness and applicability of our method in diverse ecological settings.Other technical limitations can also impact clustering accuracy.Smaller ribosomal proteins L28 and L34 could be detected less frequently in metaproteomics, and post-translational modifications may also result in missed cleavage and identification of peptides.Advances such as metaproteomics-assembled proteomes [83] may improve taxonspecific functional annotations and the accuracy of our clustering outcomes.

Figure 6.
Microbes modify their expression for ABC-type transporters to adapt to added sugars; all heatmaps share the same color bar on the right; (A) schematic of in vitro cultures of a collected human gut microbiome; in the treatment group, one sugar is added to the community; metaproteomic measurements 5 h later were used to compare the intensity of each taxon-specific protein using the log2 fold change of each protein's fraction (i.e.normalized intensity over each genus) from the treatment group divided by that from the control group; Log2 fold changes of ABC-type transporters were computed 5 h after (B) glucose, (C) fructose, (D) kestose, or (E) glucose and fructose is added; the transported metabolites for each COG are added to the brackets.
ij is binary, i.e. d GCN ij = 0 if and only if both taxa share the potential to express the gene, and d GCN ij = 1 otherwise.d PCN ij = 0 if and only if both taxa have expressed the protein.Here, we define FR g and FR p

Figure 2 .
Figure 2. Three protein functional clusters (specialist function, essential function, and niche function) considered in the community assembly model form three distinct clusters when the network degree and FR are compared between the GCN and PCN in model-generated synthetic data; (A1-A4) three types of functions modeled have different ecological and metabolic roles; the niche function (the protein missing in A2) and specialist function (the protein missing in A3) are modeled as abilities to consume externally supplied resources; the role of essential functions (the protein missing in A4) is considered as a reduction in the overall growth rate for each missing essential function; (B) a schematic diagram of the community assembly; species (ovals and indented ovals) with expressed gene functions selected via the sub-sampling of their genomic capacity; then all species are co-cultured together to simulate their ecological competition; (C) a simulation example of the community assembly, and the construction of GCN and PCN for the survived species; (D and E), the comparison of network degree and FR, respectively, based on the GCN and PCN of survived species in the simulation example in panel-C; a Gaussian mixture model with three clusters is used to identify three protein functional clusters; ellipses around clusters cover areas one standard deviation away from their means; (F-G) the comparison of network degree and FR, respectively, based on the GCN and PCN of 35 species randomly selected from the 10 000 species in the initial pool; all points/functions are colored red (niche functions), green (essential functions), and blue (specialist functions) according to their types of functions in the model; k GCN (or k PCN ) is the network degree of each function in the GCN (or PCN); FR g (or FR p ) is the FR of each function on the gene level (or protein level), respectively.

Figure 3 .
Figure 3. Real data of the human gut microbiome showing three clusters on the plot that compares FR g with FR p ; metagenome and metaproteome of subject HM454 mucosal-luminal interface samples[33] were used to construct GCN and PCN, respectively; (A) the GCN shows if a genus owns (or doesn't own) a COG as its genomic capacity, which is filled (or empty); the GCN matrix is ordered to have decreasing network degrees for both genera and COGs; (B) the PCN shows if a genus expresses (or doesn't express) a COG as its protein function, which is filled (or empty); the PCN matrix follows the same order as the GCN; (C) differences in network degree for most COGs are large; k GCN is the network degree of each COG in the GCN (i.e. the number of genera owning each COG in the GCN); k PCN is the network degree of each COG in the PCN (i.e. the number of genera owning each COG in the PCN); (D) FR g is larger than FR p for most COGs; three functional clusters are predicted by the Gaussian mixture model with three clusters fitted on synthetic data; the transparent large circles represent centroids of three clusters; (E) the relationship between FR p and network degree of PCN for COGs is not monotonic.

Figure 4 .
Figure 4.Comparison of network degree and FR between the gene and protein level for many protein families including ABC-type transporters and ribosomal proteins from the human gut microbiome; (A) network degrees in GCN are larger than network degrees in PCN for most ABC-type transporter COGs; k GCN (or k PCN ) is the network degree of each COG in the GCN (or PCN); (B) FR g is larger than FR p for most ABC-type transporter COGs; (C and D) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ABC-type transporter COGs shows a significantly huge reduction from k GCN to k PCN or from FR g to FR p ; (E) network degrees in GCN are comparable with that in PCN for most ribosomal protein COGs; (F) FR g is comparable with FR p for most ribosomal protein COGs; points in scatter plots are colored by the same colors used in Fig.3d; (G and H) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ribosomal protein COGs shows no significant reduction from k GCN to k PCN or from FR g to FR p ; (I) the fraction of assigned specialist, niche, or essential functions based on comparing network degrees k GCN and k PCN for many protein families; (J) the fraction of assigned specialist, niche, or essential functions based on comparing functional redundancies FR g and FR p for many protein families; in all boxplots, the middle white dot is the median, the lower and upper hinges correspond to the first and third quartiles, and the black line ranges from the 1.5 × IQR (where IQR is the interquartile range) below the lower hinge to 1.5 × IQR above the upper hinge; all violin plots are smoothed by a kernel density estimator and 0 is set as the lower bound; all statistical analyses were performed using the two-sided Mann-Whitney-Wilcoxon U test with Bonferroni correction between genomic capacity (GCN) and protein functions (PCN); P-values obtained from the test is divided into five groups: (1) P > .05(ns),(2) .01< P ≤ 0.05 ( * ), (3) 10 −3 < P ≤ .01 ( * * ), (4) 10 −4 < P ≤ 10 −3 ( * * * ), and (5) P ≤ 10 −4 ( * * * * ); network degree comparison of ABC transporters: P = 7.11 × 10 −16 ; network degree comparison of ribosomal proteins: proteins: P = .10;redundancy comparison of ABC transporters: P = 2.19 × 10 −11 ; redundancy comparison of ribosomal proteins: P = 1.00.
Figure 4.Comparison of network degree and FR between the gene and protein level for many protein families including ABC-type transporters and ribosomal proteins from the human gut microbiome; (A) network degrees in GCN are larger than network degrees in PCN for most ABC-type transporter COGs; k GCN (or k PCN ) is the network degree of each COG in the GCN (or PCN); (B) FR g is larger than FR p for most ABC-type transporter COGs; (C and D) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ABC-type transporter COGs shows a significantly huge reduction from k GCN to k PCN or from FR g to FR p ; (E) network degrees in GCN are comparable with that in PCN for most ribosomal protein COGs; (F) FR g is comparable with FR p for most ribosomal protein COGs; points in scatter plots are colored by the same colors used in Fig.3d; (G and H) the distribution of network degrees and functional redundancies (violin plots and boxplots) for ribosomal protein COGs shows no significant reduction from k GCN to k PCN or from FR g to FR p ; (I) the fraction of assigned specialist, niche, or essential functions based on comparing network degrees k GCN and k PCN for many protein families; (J) the fraction of assigned specialist, niche, or essential functions based on comparing functional redundancies FR g and FR p for many protein families; in all boxplots, the middle white dot is the median, the lower and upper hinges correspond to the first and third quartiles, and the black line ranges from the 1.5 × IQR (where IQR is the interquartile range) below the lower hinge to 1.5 × IQR above the upper hinge; all violin plots are smoothed by a kernel density estimator and 0 is set as the lower bound; all statistical analyses were performed using the two-sided Mann-Whitney-Wilcoxon U test with Bonferroni correction between genomic capacity (GCN) and protein functions (PCN); P-values obtained from the test is divided into five groups: (1) P > .05(ns),(2) .01< P ≤ 0.05 ( * ), (3) 10 −3 < P ≤ .01 ( * * ), (4) 10 −4 < P ≤ 10 −3 ( * * * ), and (5) P ≤ 10 −4 ( * * * * ); network degree comparison of ABC transporters: P = 7.11 × 10 −16 ; network degree comparison of ribosomal proteins: proteins: P = .10;redundancy comparison of ABC transporters: P = 2.19 × 10 −11 ; redundancy comparison of ribosomal proteins: P = 1.00.

Figure 5 .
Figure 5.Comparison of network degree and FR between the gene and protein level for many protein families including ABC-type transporters and ribosomal proteins from the synthetic mouse gut microbial community; (A) comparison between network degrees of all COGs in the GCN (k GCN ) and network degrees of all COGs in the PCN (k PCN ); (B) comparison between functional redundancies of all COGs in the GCN (FR g ) and functional redundancies of all COGs in the PCN (FR p ); (C and D) the distribution of network degrees and functional redundancies (violin plots and boxplots) for all COGs; (E-H) comparison between k GCN and k PCN , comparison between FR g and FR p , the distribution of network degrees, and the distribution of functional redundancies for ABC-type transporter COGs; (I-L) comparison between k GCN and k PCN , comparison between FR g and FR p , the distribution of network degrees, and the distribution of functional redundancies for ribosomal protein transporter COGs; (M) the fraction of assigned specialist, niche, or essential functions based on comparing network degrees k GCN and k PCN for many protein families; (N) the fraction of assigned specialist, niche, or essential functions based on comparing functional redundancies FR g and FR p for many protein families; in all boxplots, the middle white dot is the median, the lower and upper hinges correspond to the first and third quartiles, and the black line ranges from the 1.5 × IQR (where IQR is the interquartile range) below the lower hinge to 1.5 × IQR above the upper hinge; all violin plots are smoothed by a kernel density estimator and 0 is set as the lower bound; all statistical analyses were performed using the two-sided Mann-Whitney-Wilcoxon U test with Bonferroni correction between genomic capacity (GCN) and protein functions (PCN); P-values obtained from the test are divided into five groups: (1) P > .05(ns), (2) .01< P ≤ .05( * ), (3) 10 −3 < P ≤ .01 ( * * ), (4) 10 −4 < P ≤ 10 −3 ( * * * ), and (5) P ≤ 10 −4 ( * * * * ).