A catalog of microbial genes from the bovine rumen reveals the determinants of herbivory

The rumen microbiota is a unique feature of ruminants and thorough knowledge of the genetic potential of rumen symbiotic microbes provides opportunities for improving the sustainability of ruminant production systems. Using deep metagenome sequencing we identified 13,825,880 non-redundant prokaryote genes from the bovine rumen and constructed 324 high quality metagenomic species. These metagenomic species were prevalent in the rumen of 77 cattle fed different feeds whereas previously identified rumen microbial genomes were less abundant. Compared to human, pig and mouse gut metagenome catalogs, the rumen is richer in functions and microbial species associated to the degradation of plant cell wall material and production of methane. Genes coding for enzymes that deconstruct plant polysaccharides showed a particularly high richness that is otherwise impossible to infer from available genomes or shallow metagenomics sequencing. These data bring new insights on functions, enzymes and microbes of the rumen, critical to understand phenotypes and biological processes.


INTRODUCTION
Ruminant production contributes to livelihood and to food and nutritional security in many 52 regions of the world. Milk and meat from ruminants are important sources of protein and micronutrients in the human diet but often criticized as unsustainable because of the low 54 conversion efficiency of plant feeds into animal foods (Aiking 2014) and also due to their high environmental footprint. However, when the feed conversion efficiency of protein and 56 energy contained in milk and meat is calculated based on the ingestion of human-inedible protein and energy the output is higher that the input, particularly in forage-based production 58 systems (Gill et al. 2010;Ertl et al. 2016). The transformation of feeds, not suitable for human consumption, into highly nutritious protein and energy products is carried out by 60 gastrointestinal symbiotic microbes, particularly those residing in ruminants' forestomach -4 the rumen. Rumen microbes are essential for ruminants allowing them to thrive in agricultural 62 land not suitable for crops and to consume agricultural byproducts unfit for other livestock species. The enhanced functions provided by the microbiome are key for the characteristic 64 adaptability and robustness of ruminants to cope with nutritional and climatic stresses (Morgavi et al. 2013). Improving our understanding of the rumen microbiome provide 66 opportunities for knowledge-based strategies aiming at enhancing efficacy in ruminant production while minimizing its negative effect on the environment. In this study, we 68 established a catalog of rumen prokaryotes genes that enabled us to decipher functions of the microbiome, in particular the capacity to deconstruct structural carbohydrates from forages, 70 and we explored the effect of feed on the microbiome composition and functions.

Constitution of the bovine rumen prokaryotic gene catalog
To build a bovine rumen prokaryotic gene catalog, we collected total rumen content samples 74 from five Holstein cows and five Charolais bulls. To reduce the ecosystem complexity and to improve metagenome assemblies, rumen ciliated protozoa were depleted from the samples 76 before microbial DNA extraction. A total of 1,206 Gb of raw metagenomic sequencing data were generated with an average 111 Gb clean data for each animal. This sequencing depth, 78 much greater than that used for other gene catalogs (5-10 Gb per sample (Li et al. 2014;Xiao et al. 2015;Xiao et al. 2016)) was necessary to enable the assembly of the more complex 80 rumen microbiome. After de novo assembly, open reading frames (ORF) prediction and removal of redundancy, 13,825,880 non-redundant genes were obtained with an average 82 length of 716 base pairs (bp) and 39% of these genes were complete ORFs (Supplementary Table 1). 84 5 Compared to the largest rumen gene catalog published to date by Hess et al. (2011), the number of non-redundant genes discovered in this study is more than 5 fold larger; shared 86 genes were in most cases also longer (Figure 1a & b and Supplementary Table 2). Thus, the mapping reads from 77 additional rumen samples obtained in this study and eight published 88 rumen samples from UK cattle (Wallace et al. 2015) increased from ~10% using the previous catalog (Hess et al. 2011) to ~40% (11-51%) (Figure 1 c & d). This confirms that the 90 c d Figure 1. (a) Identity of current study genes compared to Hess et al. (2011). (b) Differences in gene length between studies. Green, the length of genes is longer in the current study; Blue, the length of genes is similar; Red, the length of genes is shorter in the current study. (c) Percentage of total reads in the current study (n=77 samples) in four different diet groups that could be mapped to current study gene catalog and JGI 2011 gene catalog. (d) Percentage of total reads in unrelated studies of 8 UK cattle samples that could be mapped to the current study gene catalog and JGI 2011 gene catalog.
6 representativeness of the rumen catalog has been greatly improved, even if the mapping efficiency was relatively low, as compared to 80% for the human gut microbiome (Li et al. 92 2014).

Comparison of gastrointestinal microbiomes: bovine rumen versus Human, pig and 94
mouse Genes were taxonomically classified using CARMA3 (Gerlach and Stoye 2011) and 96 compared to genes from human, mouse and pig gut catalogs (Li et al. 2014;Xiao et al. 2015;Xiao et al. 2016). Up to 42.7% of rumen genes could be annotated to known phyla. This 98 value is similar to pig gut (41.3%) but lower than the human (55.9%) and mouse gut metagenomes (59.6%) (Supplementary Figure 1

and Supplementary Table 3). Firmicutes and 100
Bacteroidetes were predominant in all catalogs representing 84-94% of assigned genes and in accord with the expected gastrointestinal-associated microbial communities in mammals. For 102 the rumen, however, the proportion of Firmicutes and that of Bacteroidetes was lower and higher, respectively, than for the other three catalogs. Other enriched phyla (>2%) in the 104 rumen catalog were the Spirochaetes, Proteobacteria, Euryarchaeota, Actinobacteria and Fibrobacteres that, with the exception of Proteobacteria, were more abundant in the rumen 106 than in the other catalogs (Supplementary Figure 2). At the genus level, only 8.7% of rumen genes could be annotated; a value similar to that of the other two animal catalogs but lower 108 than that of human (16.8%), reflecting a more extensive characterization of human-associated microbes. However, the top 10 enriched genera in the rumen showed distinct abundance 110 patterns compared with the same genera in other catalogs (Supplementary Figure 3 and   Supplementary Table 3 & 4). These differences in symbiotic microbial genera likely reflect 112 dissimilarities in dietary lifestyles, anatomical localization of the gut fermentation compartment and are indicative of predominant functions, i.e. methane production and plant 114 fiber degradation for ruminants. Prevotella was the most abundant rumen genus with 39% of 7 genus-annotated genes assigned. Other abundant genera were Treponema, Butirivibrio,116 Methanobrevibacter and Ruminococcus that were absent or at lower proportions in other catalogs, particularly in human. 118 Based on abundance profiles and clustering methods we identified 324 metagenomic species (MGS), with an average size of 1.8 Mbp (minimum threshold of 1 Mbp; see Methods 120 for more information on these MGSs). More than half (173) were medium-quality and 23 were high-quality drafts (>90% completion, <5% contamination) (Bowers et al. 2017). These 122 rumen MGSs were also compared to pig and mouse MGSs. For the rumen, 314 were annotated to the Bacteria domain and one to Archaea. For the bacterial MGSs, 44.4% could 124 be annotated to the order level but only 3.7% (12 MGSs) to the genus level; 10 of these belonged to Prevotella. The 324 MGSs were present in all four diets groups from our 126 validation cohort; about 10% of reads from the 77 cattle dataset mapped to these MGSs. For genomes from the Hungate 1000 project (Kelly), which are representative of the diversity of 128 cultured rumen bacteria and archaea, mapping rate was 5.4%. In contrast, only 0.1% of reads mapped to the metagenomic species described by Hess et al. (2011) (Supplementary Figure 130 4 pectins and hemicelluloses (GH43, GH28, GH10, GH51, GH9 and GH78, by decreasing abundance). In contrast, only one of the 15 most abundant families, namely GH25 150 lysozymes, targets a non-plant substrate (peptidoglycan). Additionally, three of the five most abundant families (GH3, GH2 and GH5) represent enzymes active on a wide range of 152 substrates, not necessarily from plant origin. Two of these families (GH2 and GH3) contain exo-glycosidases that act on the oligosaccharides produced by depolymerases, a broad 154 function that may explain their abundance.
Dockerins domains (DOCs) are key building blocks of cellulosomes and amylosomes 156 complexes. The DOC sequences are found in modular proteins and help the protein to which they are appended to bind cohesin domains (COHs) found as repeats in large proteins named 158 scaffoldins. This system allows the spatial grouping of numerous binding and enzymatic modules into large assemblies for a synergistic action of their components in the immediate 160 vicinity of the bacterial cell. In the rumen catalog, more than 12,000 dockerin modules were identified. Intriguingly, some proteins harbored many dockerin modules, up to 13 modules in 162 a single sequence, without any other recognizable functional module. The function of such 9 polydockerin proteins is unknown, and polydockerin proteins were not observed in 164 reconstructed MGS (max. of two DOCs in a protein). In the literature, dockerin modules initially detected in cellulosomes, have been investigated in relation to their co-occurrence 166 with CAZymes in these cellulosome complexes (Brulc et al. 2009;Turnbaugh et al. 2010).
Surprisingly, our analysis of the rumen catalog reveals that only ~24% of the DOC-containing 168 proteins carry a CAZyme domain. The remaining DOC-containing proteins were subjected to a Pfam domain annotation, which identified proteases (4%) and some lipases (<0.3%), while a 170 third of DOC-containing proteins are attached to non-catalytic modules, likely involved in the binding of these non-carbohydrate substrates. More importantly, the last third did not have 172 any match to any Pfam domain (Supplementary Figure 5).
The CAZyme profile in the rumen catalog was compared to the mouse, pig and human 174 reference gut catalogs (Li et al. 2014;Xiao et al. 2015;Xiao et al. 2016). Despite important differences in the size of these catalogs, similar trends could be observed on, for example, the 176 ratio of DOCs or GHs plus PLs over the catalog size, or the most abundant GH families (Supplementary Table 6). The number of distinct GHs/PLs is also very similar, and a detailed 178 analysis highlighted 101 GHs families common to all four catalogs, while only five GH families were specific to a single catalog (Supplementary Figure 6). These specific families 180 were closely related to hosts' diets. In accord with herbivory, 305 GH45 cellulase modules were found in the rumen catalog against none in human, mouse catalogs, and only 12 for the 182 pig. In contrast, we identified families GH70 and GH68, transglycosidases acting on sucrose, and GH47, processing N-glycan, that are absent in the rumen but present in other catalogs. 184 For instance, 94, 24 and 6 GH70 modules were found in the human, pig and mouse catalogs, respectively, whereas the rumen had zero occurrence. 186 The specific adaptation of the rumen catalog to herbivory was confirmed by comparing its GH+PL family counts against the human catalog after normalization ( Figure 2). The most 188 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 10 enriched GH families in the rumen are involved in the degradation of plant polysaccharides while the more depleted families of GHs are those degrading animal (host) glycans. These 190 observations are not only in accord with cattle normal diet but they are also in agreement with the absence of a glycoprotein-rich mucus lining the rumen as opposed to the lower 192 gastrointestinal tract. Finally, we also observed that multiple DOC module duplications seem to be more frequent and intense in the rumen as up to 13 DOC repeats in a single protein were 194 found for the rumen catalog, compared to only six in the human, four in the pig and two in the mouse catalogs.  The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 11 generated MGS were thus determined and subjected to a hierarchical clustering analysis (Supplementary Figure 7) that revealed that MGSs group together roughly to their predicted 202 taxonomy, even despite large differences in repertoire size within each phylum. Hereafter, we detailed several strategies for carbohydrate foraging that evolved in the different bacterial 204 phyla. Among the predicted Firmicutes, MGSs encoding cellulosomes and amylosomes displayed a readily recognizable profile characterized by the presence of several DOC and 206 COH domains along with several GHs families containing cellulases (GH5, GH44, GH48, and GH124) and amylases (GH13 with associated CBM26), respectively. We also identified 208 Bacteroidetes MGSs that contained a few DOC domains but, interestingly, none of these MGSs appeared to contain a recognizable COH domain. The presence of dockerin domains 210 not associated to cohesins in Bacteroidetes MGSs was recently reported in the moose rumen microbiome (Svartstrom et al. 2017). The role of the dockerins in Bacteroidetes is unclear but 212 the conspicuous absence of cohesins suggests that they may not be for the assembly of a bona fide cellulosome or that the Bacteroidetes cohesins are so distantly related from their 214 clostridial counterparts that they cannot be recognized.
Confirming previous reports in the literature (Kaoutari et al. 2013), the largest CAZyme 216 repertoires dedicated to plant degradation were found among the predicted Bacteroidetes, which represent the majority of the 324 reconstructed genomes. In Bacteroidetes, CAZymes 218 are often grouped in distinct Polysaccharide Utilization Loci (PULs) around susC and susD marker genes to build up specific depolymerization machineries capable of deconstructing in 220 a synergistic manner even the most complex polysaccharides (White et al. 2014;Ndeh et al. 2017). In this context, it is interesting to note the clustering of families GH137 to GH143 222 recently shown to catalyze the breakdown of type II rhamnogalacturonan (Ndeh et al. 2017 The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint the Bacteroidetes MGSs revealed the presence of degradation machineries dedicated to pectin (type II rhamnogalacturonan), starch, or barley β-glucan (Supplementary Figure 8). 226 Other MGSs with distinctive CAZymes were those assigned to Proteobacteria and Fibrobacteres that despite their small number (eight and six respectively) form tight groups. 228 Predicted Proteobacteria were characterized by the presence of families GH84 and GH103 along with an important diversity of GH13 subfamilies. In contrast, the Fibrobacteres show 230 the presence of several families known to degrade cellulose and β-glucans (e.g. GH5, GH45, and GH55). Focusing on CAZymes from Fibrobacter spp. present in the catalogue revealed 232 an astonishingly strain-level diversity for this genus. We compared the CAZymes present in  Table 7). Zooming on a particularly important 238 endoglucanase enzyme, i.e. GH45, reveals its presence in all 77 animals receiving different diets. Animals harbored between 4 to 13 GH45 variants and each gene was present in 25 to 240 up 99% of all animals; however, the type strain, at 79%, was not the variant most commonly present. 242

Common functions and Influence of diet on the bovine rumen microbiome
To investigate how different feeds affected the rumen microbiota in beef and dairy cattle we 244 examined beef Charolais bulls fed either a fattening diet high (n=16) or low (n=18) in starch and lipids and Holstein cows fed a corn silage and concentrate diet (n=23) or grazed a natural 246 prairie (n=20). By using this 77-sample dataset, differences in α-diversity were observed between diets at the gene level. Animals fed fresh grass had the highest α-diversity and 248 richness compared to other diets containing conserved feeds. Particularly, animals on . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 13 fattening diets had a lesser α-diversity. In contrast, the fattening diet rich in starch and PUFA 250 had the highest β-diversity and/or had the highest disparity in interquartile range (box in the boxplot) for all indices (Figure 3). The rumen microbiome of animals fed this diet also 252 exhibited the highest dispersion on ordination analyses at the gene level (Supplementary   Table 8 and Supplementary Figure 12 a & b). For CAZy, 288 146 catabolic families exhibited indeed differences in abundance between the corn silage and grazing groups ( Supplementary Figures 12a and 13, Supplementary Table 9). Most of the 290 differences related to functions were due to increases in the relative abundance of genes in cows fed the corn silage diet rather than the presence of different genes. Notwithstanding, the 292 greatest contrast was observed for families targeting fructans and sucrose that were more abundant in the grazing group. Particularly for family GH32 (P = 7.6E-12) whose higher 294 abundance could be related to the high contents of sucrose and fructans in grasses (Ghasempour et al. 1998;Vijn and Smeekens 1999) included in the grazing diet. The other 296 CAZy families differing in abundance were all more abundant in the corn silage-diet group.
Interestingly, these results highlight the ability of ruminal bacteria to be equally capable of 298 using glycans from plants as well as from microbial origin such as bacterial peptidoglycans, bacterial exopolysaccharides and fungal cell walls. Corn silage, the main constituent of the 300 diet, is a fermented feed with an abundant epiphytic microbiota composed of . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 16 exopolysaccharide-producing lactic acid bacteria, fungi and yeasts (Storm et al. 2010;Cheli et 302 al. 2013). Accordingly, the CAZome of the corn silage-diet group was oriented towards degradation of starch, a nutrient abundant in corn silage and practically absent in the grazing 304 diet. Forty-two CAZy families targeting plant cell wall polysaccharides were also overabundant in the corn silage-diet group. This could reflect the diversity of fiber structures 306 that ruminal bacteria have to face when cows are fed with such a diversified diet in terms of plant fractions and botanical origins (whole corn plant and soybean meal in the corn silage 308 diet against a natural prairie, composed predominantly of grasses in the grazing diet). Finally, the overabundance of CAZy families targeting animal glycans in the silage-fed cohort was 310 Figure 4. Size of the shared microbiome features among cattle (n= 77) fed four different diets for the number of genes (black), genera (orange), phyla (purple), MGSs (blue), KEGG pathways (red), and CAZy (green). The percentages of shared items and animals are represented on the y and x axes, respectively. The absolute numbers for each item are indicated at the intercept between the percentages of items and animals at the thresholds of 50, 90 and 100%. Only ~1% of genes were shared by 90% of the cattle, whereas close to 80% of KO and CAZy functions were shared by 90% of the cattle, suggesting gene redundancy for similar functions. To note the presence of most MGSs assembled in this work in 90% of the cattle.
17 striking since no glycoprotein-rich mucus is secreted in the bovine rumen as opposed to the lower gastrointestinal tract (Hoorens et al. 2011). It is possible that this difference reflects 312 that CAZy families targeting animal glycans harbor numerous enzymes that are not fully characterized and may be able to act on plant or even fungal glycans, which contain a panel of 314 osidic constituents that are very similar to that of animal glycans. Enzyme promiscuity may indeed confer metabolic flexibility and an ecological advantage to certain microbes in the gut 316 ecosystem.
For genera and MGSs, up to 44% (106 genera) and 58% (188 MGSs) of the total detected 318 were differently abundant in the microbial communities of the two cows' groups (Supplementary Table 10). Fibrobacter and Ruminoccocus were more abundant in the corn 320 silage diet group whereas Prevotella, Butyrivibrio and Methanobrevibacter were more abundant in the grazing group. 322 For Charolais on fattening diets differing in starch content, less than 5% differences were observed in the abundance of genes for functions or genera. Only eight CAZy families 324 exhibited differences in abundance between the two Charolais groups, the differences in abundances being less significant than for the Holstein groups (p > 0.004) (Supplementary 326 Figures 12a and 13, Supplementary Table 9). The absence of marked variations in the abundance of glycoside-degrading enzymes between the two fattening diets reflects indeed 328 their similar composition. The differences in starch content was not great enough to drastically impact the carbohydrate harvesting functions of the ruminal microbiota, at least at 330 the gene level. Similarly, smaller differences in the abundance of genera and MGSs were detected between these two diets (Supplementary Table 8, Supplementary Figure 12d). 332 Metadata collected on the Holstein and Charolais animals were analyzed using a vector fitting method on the top of the bidimensional NMDS ordination (Supplementary Table 11). 334 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint Diet had a significant effect on metagenome gene distribution, particularly in the Holstein group (r 2 = 0.68, P = 0.0001), but also variables such as live weight, feed intake and rumen 336 volatile fatty acids were significant. Protozoal numbers were also a significant variable explaining the distribution of genes in the metagenome of animals, underpinning their 338 importance as key members of the rumen ecosystem and modulators of the prokaryotic community. 340

Antibiotic resistance genes
We evaluated the presence of antibiotic resistance genes (ARGs) in the rumen gene catalogue 342 as previously reported (Xiao et al. 2016). Forty-two ARGs encoding resistance to 27 antibiotics were detected in the catalog. The most abundant resistances were to tetracycline 344 and bacitracin with Charolais animals harboring globally a higher proportion of these genes (Supplementary Figure 14). It is noted that antibiotics as growth promoters were never used 346 on these animals. In both the bovine rumen and the porcine gut (Xiao et al. 2016), the most abundant ARGs confer resistance to tetracycline and bacitracin. The diversity of ARG is low 348 compared to pig feces where resistance to up to 52 antibiotics was reported, even in farms with no use of growth promoting antibiotics (Xiao et al. 2016). 350

DISCUSSION
Ruminants are extraordinary bioreactors, engineered by nature to use recalcitrant plant 352 biomass-a renewable resource-as feedstock for growth and for production of useful products. This ability is a microbial attribute that was important in domestication and that 354 today has a renewed interest due to human population increases, resource scarcity and climate change issues. The reference gene catalog from the rumen microbiota reported here is a 356 useful resource for future metagenomics studies to decipher the functions and interactions of this complex ecosystem with feeds and the host animal. Comparison with human, mouse and 358 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 19 pig gut catalogs shows the distinct character and potential of the rumen ecosystem. As opposed to the microbiome of single-stomached animals including humans, the rumen 360 microbiome harbors a plethora of genes coding for glycoside hydrolases (CAZymes) that degrade structural polysaccharides. Information on these enzymes that deconstruct biomass 362 plant material and are essential for transforming recalcitrant feeds into meat and milk is also useful for the design of improved processes for the biofuel industry (Weimer et al. 2009;Liao 364 et al. 2016).
The type of diet modulated as expected the abundance of genes and the metagenome profile 366 of individuals. However, more than 90% of genes coding for functions (KO and CAZy) were shared among animals receiving different diets. This large functional diversity might be the 368 key that allows ruminants to feed on a variety of dietary sources and to adapt to seasonal or production-imposed dietary changes. The 13.8M genes catalog produced in this work, despite 370 being significantly larger than gut bacterial catalogs from other species (Li et al. 2014;Xiao et al. 2015;Xiao et al. 2016)  Agriculture and all other applicable National and European guidelines and regulations.

Rumen Sampling 384
Total rumen content samples from 10 animals used for deep sequencing metagenome were taken at the experimental slaughterhouse of the INRA Centre Auvergne-Rhône-Alpes. Total 386 rumen content samples from 77 animals were also collected. These 77 animals, from two different genetic stocks, were fed diets characteristics of beef and milk production systems. 388 Beef cattle, represented by Charolais breed were fed fattening diets high (n=16) and low (n=18) in starch and lipids; whereas Holstein dairy cows were fed a corn silage and 390 concentrate diet (n=23) or grazed a natural prairie (n=20) (Supplementary Table 12). Rumen samples from these animals were also collected at the experimental slaughterhouse except for 392 the grazing group. Cows from this latter group were fitted with rumen cannula and samples were taken from live animals. 394

Sample handling and DNA extraction
The 10 rumen samples used for deep sequencing were depleted from eukaryotes using 396 washing and centrifugation steps. Rumen contents were filtered through a 400 µm nylon monofilament mesh. The filtrate was centrifuged at 300 g, 5 min to decant protozoa and the 398 supernatant (fraction A) was stored at 4 °C. Fifty grams from the filtered rumen content retentate were mixed with 100 ml of anaerobic phosphate saline buffer (PBS), mixed 400 manually for 5 min by gentle inversion, centrifuged at 300 g, 5 min to decant protozoa and the supernatant, passed through a 100-µm filter (fraction B), was stored at 4 °C. The pellet was 402 mixed with 75 ml anaerobic, ice-chilled 0.15% Tween 80 in PBS and incubated on ice for 2.5 h to detach microbes attached to feed particles. At the end of the incubation, contents were 404 vortexed for 15 s and centrifuged at 500 g, 15 min. The supernatant (fraction C) was mixed with fraction B and 50 ml of fraction A and centrifuged at 20,000 g, 20 min, 4 °C. The 406 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 21 supernatant was decanted and the microbial pellet was exposed to an osmotic shock to lyse any remaining eukaryote (mainly protozoal) cells followed by an endonuclease treatment. 408 Briefly, the pellet was suspended in water (Millipore Waters Milli Q purification unit) and incubated for 1 h at room temperature followed by DNase treatment (Benzonase, Novagen) as 410 described (Hunter et al. 2011). The suspension was filtered through a 10 µm monofilament textile, collected by centrifugation as before, suspended in an appropriate volume of PBS and 412 stored at -80 °C until DNA extraction. DNA was extracted following the method described by Yu and Morrison (2004). Samples from 77 animals were extracted directly from whole 414 rumen contents using the same extraction method.

DNA library construction and sequencing 416
Paired-end (PE) metagenomic libraries were constructed and sequenced following Illumina HiSeq2000's instruction. Quality control and bovine DNA removal (by aligning reads to Bos 418 taurus genome Btau_4.0 (Elsik et al. 2016)) for each sample were independently processed using MOCAT pipeline as previously described (Kultima et al. 2012). On average, 111.3 Gb 420 of high-quality bases was generated for each of the 10 deep sequencing samples and 3.43 Gb (median ~2.5 Gb) for each of the 77 samples (Supplementary Table 12). The averaged 422 proportion of high-quality bases among all raw bases from each sample was 92.29%.

Public data use 424
Three public rumen microbial datasets used in this study include: (i) a cow rumen microbiome  Table 1). 444

Evaluation of current rumen microbial gene catalog
To assess the representative of our rumen gene catalog, we used the largest rumen gene 446 catalog published to date by Hess, et al.(Hess et al. 2011). First, the genes with gaps were filtered as follows: genes were broken when meet 'N' base, a subset for each interrupted gene 448 was obtained, retaining only the longest sub-gene as representative of the original gene. A total of 2.46 M genes without gaps were obtained, termed 'JGI-2011-gene-catalog' and used 450 for following analysis (Supplementary Table 2).
Further, 13.83 M genes from current study and 2.46 M genes from JGI were pooled 452 together to identify shared genes using CD-HIT (Li and Godzik 2006). The comparison of gene length between the two catalogs was conducted as previously described (Li et al. 2014). 454 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint

Gene catalog annotation
Taxonomic assignments of genes from rumen, mouse, pig and human guts were performed 456 using CARMA3(Gerlach and Stoye 2011) on the basis of BLASTP (Altschul et al. 1990) ( V2.2.24) against the NCBI-NR database (v20130906 for rumen, mouse, pig guts; v20160219 458 for human gut) (Supplementary Table 3). Microbiotas from these four species were compared at different taxonomic levels. Functional annotation based on Kyoto Encyclopedia of Genes 460 and Genomes (KEGG) database was performed using an in-house pipeline (Li et al. 2014).
Annotation of the carbohydrate-active enzymes (CAZymes) of each catalog was performed by 462 comparing the predicted protein sequences to those in the CAZy database and to Hidden Markov models (HMMs) built from each CAZy family (Lombard et al. 2014), following a 464 procedure previously described for other metagenomics analyses (Svartstrom et al. 2017). In order to allow a direct comparison of the results, annotation of antibiotic resistance genes 466 (ARGs) was done as previously reported in the pig metagenome catalogue (Xiao et al. 2016) by using the ARDB database (Liu and Pop 2009). 468

Construction relative abundance profiles of genes, KOs, ARG and CAZY enzymes
The gene profiles of 77 rumen samples were generated by aligning high-quality clean reads to 470 the current 13.83M gene catalogue (SOAP2, ≥95% identity) . Gene relative abundance was estimated as described previously (Qin et al. 2010). The relative abundance 472 of each KEGG orthologous group (KO), ARGs and CAZy enzyme was calculated from the abundance of its genes (Li et al. 2014). 474

Characterization of total and minimal metagenome
We computed the total and shared number of genes, KO and CAZy functions present in 476 random combinations of n individuals (with n=2 to 77, 100 replicates per bin) (Qin et al. 2010). Furthermore, we used a permutation test to identify the second-level KEGG functions 478 that were significantly enriched or depleted in the minimal KO set compared with the total

Construction of metagenomic species (MGS) and taxonomic assignment 490
To recover the draft bacterial and archaeal genomes from the 10 deep sequenced samples, we developed an in-house pipeline that comprises three steps as indicated in Supplementary 492 Figure 18 and described below.

Construction of Scaftig-Linkage Groups (SLGs) 494
We generated a scaftig abundance profile by aligning high-quality clean reads from 77 rumen samples to assembled scaftigs from samples . Scaftig relative 496 abundance was determined using the same method for gene abundance ). The highly co-abundance correlated scaftigs from each deep sequencing sample were binned into 498 scaftig-linkage groups (SLGs) using the previous pipeline  with modified parameters as follows, an edge was assigned between two scaftigs sharing Pearson correlation 500 coefficient > 0.7 and the minimum edge density between a join was set as 0.99. A total of 745 preliminary SLGs with length > 1Mbp were generated for further analysis. 502 . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint 25

Filtering of Preliminary SLGs based on GC content and assembly outputs
For all preliminary SLGs, we then examined their specificity by plotting the GC content 504 versus reads aligned depth of each scaftig. In this step, 520 SLGs containing sole GC cluster were treated as 'qualified' and retained for the step 3. For the remaining 225 SLGs, 184 506 presented a scattered GC distribution and were discarded whereas the 41 SLGs containing two or more GC clusters were further processed. First, those SLGs with scaftig N50 <2000bp 508 were considered as too fragmented and discarded. Then, multiple GC clusters in remaining SLGs were separated by DBSCAN(M. Ester et al. 1996) (Eps<=0.10, MinPts>=49). After 510 splitting and filtering, we retained 55 'qualified' SLGs that had a coverage depth greater than 20×. 512

Reconstruction of metagenomic species
In order to improve the completeness and remove the redundancy of multiple metagenome 514 assemblies from 10 deep sequencing samples, we performed hierarchical clustering for these 575 qualified SLGs based on their scaftigs nucleotide identity calculated by MUMi (Deloger 516 et al. 2009). The MUMi distance between two SLGs (a and b) was defined as: is the length of SLG a, and ℎ is the length 520 of unmapped sequence compared with SLG b. The threshold for generating a species level metagenomic species (MGS) at 0.54 for MUMi as previously suggested (Backhed et al. 522 2015). Two hundred and eighteen qualified SLGs could not be clustered with other SLGs and were defined as singleton-MGSs. The remaining 357 qualified SLGs were clustered into 105 524 26 candidate-MGSs. We performed overlap-based assembling on the scaftigs for each of these 105 candidate-MGSs respectively, using Phrap with default parameters. To get reliable 526 contiguous sequences for each candidate-MGS, the overlaps between two scaftigs less than 500bp were considered as unreliable and re-broken. 528 The 105 reconstructed candidate-MGSs were further examined using GC patterns using the same method mentioned in step 2 above. Eighty out of the 105 candidate-MGSs containing 530 sole GC cluster were retained as combined-MGS. The remaining 25 candidate-MGSs containing two or more clusters were split into sub-MGSs using the same method mentioned 532 in step 2 above. In order to preserve the most comprehensive genomic information for these sub-MGSs, sequences from each sub-MGS was aligned back to its original SLGs. If the sub-534 MGS covered 90% or more sequences of its original SLG, it would be retained as a revised-MGS. Otherwise, its original SLG will replace the corresponding sub-MCs and be considered 536 as a revised-MGS. This splitting step finally obtained 31 revised-MGSs.
After filtering the total sequence size of 218 singleton-MGSs, 80 combined-MGSs and 31 538 revised-MGSs with the criterion of > 1Mbp, we finally obtained 324 MGSs for rumen microbiota including 224 singleton-MGSs and 100 combined-MGSs (Supplementary Table  540 14). We used the same pipeline described above for the gene catalog for the ORF prediction and taxonomic annotation of MGS genes. We used CheckM (Parks et al. 2015) to estimate of 542 the completeness, contamination and heterogeneity of metagenomic species (Supplementary   Table 14). MGSs were assigned a taxonomic level annotation if more than 50% of its genes 544 were assigned at a given taxonomic level (including genes with no match) (Supplementary Table 15). The MGS relative abundance of 77 rumen samples was calculated from the 546 relative abundance of its aligned genes.
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint

Quality assessment and taxonomic annotation of MGSs 548
CheckM software (Parks et al. 2015) was used to calculate the completeness and contamination of these MGSs. The median percentage of completeness was high at 62.5% 550 with a low, 2.6% contamination. The combined-MGSs showed higher completeness but also slightly higher levels of contamination and strain heterogeneity than singleton-MGSs 552 ( Supplementary Figures 15, Supplementary Figures 16). Taxonomic annotation for rumen MGSs was performed using CARMA3 on the basis of BLASTP against the NCBI-NR 554 database (v20130906) and compared with MGSs from pig and mice (Supplementary Table   14, Supplementary Figure 17). 556

Cluster distribution by diet at species level
The relative MGS abundance profile (matrix of 324 × 77) obtained above was analyzed to 558 highlight differences induced by diet. As we found when coverage of a MGS is less than 0.1 the depth of this MGS is close to 0 (Supplementary Figure 19). This result is caused by the 560 noise and non-conducive to the MGS clustering. Therefore, when the coverage value was less than this threshold value, then we set the value of depth equal to 0. 562

Ordination and differentially abundance analyses
Breed and diet distribution were visualized in ordination analyses based on two-dimensional 564 non-metric multidimensional scaling (Shepard 1962). Dissimilarity between pairs of samples was calculated using Bray-Curtis dissimilarity index (Bray and Curtis 1957). Vegan R 566 package (Oksanen et al. 2016) was also employed to estimate the diversity indexes corresponding to richness, alpha (Shannon index) and beta diversity (Whittaker). The 'envfit' 568 function of Vegan was used to determine whether phenotype information corresponding to the 77 samples contribute to the overall pattern of the rumen microbiome structure. The 570 significance of the environmental factors was assessed after 9999 permutations.
. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint

28
The relative abundance of the 13,825,880 non-redundant genes was collapsed into 572 taxonomic (Phylum and Genus) and functional levels (KEEG and CAZy). Procrustes rotation analysis was performed to compare the ordinations obtained at different levels. Identified 574 KOs were mapped to KEGG and visualized using the Interactive Pathway Explorer (iPath2.0) web-based tool (Yamada et al. 2011). To estimate a core, the overlapping number of Genus, 576 CAZy and KOs between Holstein and Charolais breeds was compared.
To avoid confounding factors such as: sex, breed and age, the differentially abundance 578 analysis was performed within breeds. Therefore, for each breed, diet comparison was done based on a Zero-Inflated Gaussian mixture model as implemented in the fitZig function of the 580 metagenomeSeq R package (Paulson et al. 2013). Correction for multiple testing was done, and the cut-off of the differential abundance was set at FDR ≤ 0.05. 582

DATA ACCESS
Metagenomic sequencing data generated in this study have been deposited in EBI database 584 under the accession code PRJEB23561. The data of assembled scaftigs, the rumen gene catalog, the rumen MGS catalog, and the abundance profile tables generated in this study have been 586 deposited in GigaScience Database (doi: to be completed). The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/272690 doi: bioRxiv preprint