Ecological relevance of flagellar motility in soil bacterial communities

Abstract Flagellar motility is a key bacterial trait as it allows bacteria to navigate their immediate surroundings. Not all bacteria are capable of flagellar motility, and the distribution of this trait, its ecological associations, and the life history strategies of flagellated taxa remain poorly characterized. We developed and validated a genome-based approach to infer the potential for flagellar motility across 12 bacterial phyla (26 192 unique genomes). The capacity for flagellar motility was associated with a higher prevalence of genes for carbohydrate metabolism and higher maximum potential growth rates, suggesting that flagellar motility is more prevalent in environments with higher carbon availability. To test this hypothesis, we applied a method to infer the prevalence of flagellar motility in whole bacterial communities from metagenomic data and quantified the prevalence of flagellar motility across four independent field studies that each captured putative gradients in soil carbon availability (148 metagenomes). We observed a positive relationship between the prevalence of bacterial flagellar motility and soil carbon availability in all datasets. Since soil carbon availability is often correlated with other factors that could influence the prevalence of flagellar motility, we validated these observations using metagenomic data from a soil incubation experiment where carbon availability was directly manipulated with glucose amendments. This confirmed that the prevalence of bacterial flagellar motility is consistently associated with soil carbon availability over other potential confounding factors. This work highlights the value of combining predictive genomic and metagenomic approaches to expand our understanding of microbial phenotypic traits and reveal their general environmental associations.


Introduction
Microorganisms navigate their environment by responding to gradients in nutrients, toxins, and environmental conditions, in a process called chemotaxis [1,2].Flagellar motility is a widespread adaptation that allows bacteria to colonize new micro-environments by facilitating access to space and nutrients [3,4], and enables escape from unfavorable conditions [5] and predators [6].For example, moving toward environmental cues is an effective mechanism by which most pathogens [7,8] and symbionts [9,10] colonize their hosts.Despite the recognition that swimming and swarming (the two main modes of f lagellar motility) are widely used to navigate microbial environments, empirical knowledge on the environmental conditions where bacterial f lagellar motility can be beneficial remains rather limited, as most knowledge derives from laboratory-based studies using model organisms.
In laboratory conditions, bacteria have been widely investigated for their ability to swim toward resources [11,12], display quorum sensing [13], or swim away from toxins [14].Several experimental studies show that the hydration level of surfaces generally predicts how easily bacteria can colonize a given surface [15], and that f lagellar motility also predicts the temporal persistence of bacterial pathogens in host microbiomes [16].The high energetic cost of powering the f lagellar machinery is tightly linked to regulatory systems that control f lagellar expression depending on the spatial proximity and quality of available resources (i.e.optimal foraging based on energetic constraints; [17][18][19]).Also, different f lagellar systems have evolved in response to distinct environmental conditions, as exemplified by the case of the Vibrio genus, which uses different f lagellar systems depending on the spatial complexity of their surroundings [20].This broad body of knowledge leads to the expectation that f lagellar motility should display general ecological associations, but such patterns have not been comprehensively explored.
Research on bacterial f lagellar motility predates modern microbiology, and laboratory-based approaches have enabled the discovery of the genes involved in f lagellar assembly [21,22].The genes encoding for the f lagellar machinery are reasonably well known and generally conserved across a broad diversity of bacterial groups [23,24].Because the production of f lagella requires a well-defined gene repertoire, the prediction of f lagellar motility across taxa is likely feasible [25].Yet, the proportion of bacterial taxa for which f lagellar motility could theoretically be inferred from genomic information contrasts with the relatively limited number of strains for which f lagellar expression has been empirically determined.A comparison of the number of strains with known motility information in bacterial phenotypic trait databases [26] to the total number of genomes contained in the Genome Taxonomy Database (GTDB r207, [27]) highlights that we have information on whether taxa are f lagellated or not for only ∼10% of the bacterial strains with available whole genome information.If we could infer the capacity for f lagellar motility across a broad diversity of microbial taxa, we could determine the set of traits that generally characterize f lagellated taxa (so-called life history strategies; [28,29]).Previous studies have linked f lagellar motility to a fast growth (copiotroph) strategy [30], which may be an adaptation to access resource patches [31], and f lagellar motility is expected to be associated with a life history strategy for rapid nutrient acquisition [32].However, one of the main challenges with identifying the life history strategies of bacteria remains the quantification of phenotypic traits.Thus, developing methods to infer f lagellar motility across single bacterial genomes and metagenomes can help us identify the main ecological and life history associations of this important trait.
Flagellar motility is likely common for bacteria living in many environments-including host-associated and aquatic environments [2].However, we are particularly interested in the prevalence of f lagellar motility in soil environments because soil is a heterogeneous environment where resources are patchily distributed, and access to resources is a key factor structuring soil bacterial communities [33,34].We expect a high degree of variability in the prevalence of motile bacteria in soil as motility requires continuous water films [35,36], and the high energetic cost associated with f lagellar motility may be disadvantageous in the resource-limited conditions often common in soil [17,37].Since organic carbon (C) compounds are likely the main sources of energy for soil bacteria, soil C availability is likely a key factor determining the selective advantage of f lagellar motility in soil.Indeed, several studies have found a higher prevalence of bacterial f lagellar motility in soil environments that generally have higher C availability.For example, plant rhizospheres usually contain elevated levels of available C compared to adjacent bulk soil environments due to plant-derived organic carbon inputs [38] and generally harbor a higher prevalence of f lagellar genes [39,40].In arid environments, several studies have detected a negative relationship between the prevalence of f lagellar motility genes and aridity [41,42], which could be due to both lower C availability or to lower moisture.Given the spatial heterogeneity of soil, and the fitness advantage theoretically gained from f lagellar motility in conditions where energy-rich resources are patchily distributed [18], we hypothesize that bacterial f lagellar motility should exhibit a general positive relationship with soil C availability.
We had three objectives for this study.First, we wanted to build genome and metagenome-based models to accurately infer the potential for f lagellar motility across bacterial taxa and whole bacterial communities.Second, we sought to identify the general life history strategies associated with f lagellar motility in bacteria.Third, we aimed to determine the prevalence of f lagellar motility in soil bacterial communities and to test the hypothesis that f lagellar motility is more prevalent in soils with higher C availability.To this end, we estimated the potential for f lagellar motility across 26 192 bacterial taxa with available genomic information based on a machine learning model trained on empirical information for this trait, and explored whether f lagellar motility is associated with broader life history strategies.We then applied a method to estimate f lagellar motility as a community-aggregated trait directly from metagenomes.We used this method to investigate the prevalence of f lagellar motility across four independent sample sets that we would expect to capture gradients in soil C availability and confirmed our findings using metagenomes from a soil incubation experiment where C availability was experimentally manipulated via glucose amendments.

Genome selection and annotation
We compiled genomic data from ∼62 000 unique bacterial taxa ("species clusters") available in the Genome Taxonomy Database (GTDB) (release 207; [27]), following the taxonomic convention of this release throughout the manuscript.We restricted our analyses to bacterial phyla with more than 100 representative genomes available in GTDB and only included genomes estimated to be >95% complete based on CheckM (v1.1.6)[43].We also removed all genomes that lacked a 16S rRNA gene, as well as those with signals of chimerism based on GUNC (Genome Unclutterer; [44]), yielding 26 192 genomes in total belonging to 12 different phyla.
The coding sequences of the 26 192 genomes were identified using Prodigal (v2.6.3;[45]).We then aligned the predicted coding sequences for each genome to the Pfam database (v35.0;[46]) using HMMER (v3; [47]) to obtain information on all potential domains and genes present in those genomes.All matches with a bit score lower than 10 were discarded.We then binarized all copy numbers of genes and domains in each genome to presence/absence for further analyses.We selected a set of 21 genes out of a total of 35 genes involved in f lagellar assembly in Pfam based on their prevalence among strains with empirical information on f lagellar motility (Supplementary Data 2).Specifically, this subset of genes was chosen based on the following criteria: (i) genes were present in >80% of taxa with experimentally demonstrated f lagellar motility, and (ii) genes were not present in >50% of taxa classified as nonmotile based on available phenotypic information (see below).This step was necessary as many nonmotile taxa conserve genes for f lagellar motility (Supplementary Data 3), and some of the f lagellar genes are not well represented in Pfam.We used the information on the presence/absence of these 21 genes across genomes to build a predictive model of f lagellar motility in bacteria (Supplementary Data 2).

Genome-based prediction of flagellar motility in bacteria
We compiled all information on whether bacterial taxa displayed f lagellar motility or not from the bacterial phenotypic trait database compiled in [26].This database contains information on motility traits for 13 481 unique bacterial strains [26].We first selected only the subsets categorized as having f lagellar motility or as being nonmotile (8191 unique strains).To obtain representative genomes for these strains, we matched the National Center for Biotechnology Information (NCBI) taxon id of each of these strains to their corresponding genome accession in GTDB.To ensure maximal reliability of the genomic information used for model training, we only kept those genomes that were 100% complete, and applied the same quality filters mentioned above.This led to a final subset of 1225 high-quality genomes (388 categorized as having f lagellar motility, 837 nonmotile) that we used for model training (Supplementary Data 1).We note that these 1225 genomes included taxa from 18 unique phyla, with the proportions of motile taxa per phylum ranging from 0% to 100% (Supplementary Fig. 2).
Since some of the genes involved in f lagellar assembly are often present in several nonmotile taxa ( [48]; Supplementary Data 3), we were not able to use standard statistical approaches to build a predictive model of f lagellar motility based on the presence/absence of 21 f lagellar genes.We thus used gradient boosted regression decision trees that could accommodate the complexity of having 21 predictive features using the "xgboost" package in R (v1.7.5; [49]).To this end, we first built a training and a test set (70:30, randomly selected) of the matrix containing the presence/absence of the f lagellar genes for each of the representative bacterial genomes with experimental information on f lagellar motility using the xgb.DMatrix function of "xgboost."We then applied Bayesian hyperparameter optimization to select the best parameters for the regressor model using the bayesOpt function of the "ParBayesionOptimization" R package (v1.2.6; [50]), specifying the objective function as a binary logistic regression.We ran k-fold cross-validation using the xgb.cv function in "xgboost" to identify the optimal number of iterations of model improvement for the final model training function.We built the boosted regression model using the optimized parameters and iterations calculated above using the xgboost function of package "xgboost."We used the function xgb.importance from "xgboost" to compute the predictive importance of the different genes in the final model, which identified 14 f lagellar genes that were most useful for predicting f lagellar motility (Supplementary Data 4), even though the full set of 21 genes was needed for accurate prediction (see Results and Discussion for details on the performance of the final selected model).We evaluated model performance using the accuracy index.

Phylogenetic analysis
To investigate the phylogenetic distribution of f lagellar motility in bacteria, we first randomly selected a single genome from each family within the 12 predominant phyla investigated (485 genomes in total).Since we had already predicted the potential for f lagellar motility across GTDB genomes, we subsetted the tree provided by GTDB with the selected genomes, which can be found in https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_r207.tree.This tree is based on the alignment of 120 single-copy marker genes and is therefore more robust than a conventional maximum likelihood tree based on the alignment of full 16S rRNA gene fragments.We visualized and edited the trees using iTOL (v5; [51]).We tested whether f lagellar motility had a phylogenetic signal by calculating the phylogenetic D index for binary traits [52], where values (positive or negative) closer to 0 indicate phylogenetic conservatism, and values closer to 1 indicate a random phylogenetic pattern.This phylogenetic analysis was conducted using the R package "ape" (v5.7-1; [53]).
We additionally explored the degree of conservatism of f lagellar motility across different levels of phylogenetic resolution by measuring the standard deviation (SD) of the f lagellar motility status (f lagellated, 1; nonf lagellated, 0) across taxa from different taxonomic ranks (phyla, classes, orders, families, and genera).For this analysis, we only included those taxa that were represented by more than one genome.

Analysis of bacterial life history strategies associated with flagellar motility
We investigated associations between f lagellar motility and broad functional gene categories by testing the prevalence of Clusters of Orthologous Genes (COGs) in the genomes of taxa predicted to be f lagellated or nonf lagellated [54].We excluded the phyla Bacteroidota, Chlorof lexota, Cyanobacteria, Spirochaetota, and Mycoplasmatota from this analysis as these phyla had either too-high (>90%) or too-low (<15%) proportions of f lagellated taxa to perform robust statistical comparisons.We annotated genomes (n = 21 551) into COG categories using eggNOG-mapper v2 [55] and calculated the genome size-corrected prevalence of each COG category per genome.We also investigated general genomic features such as genome size and the 16S rRNA gene copy number for each of the genomes to compare these genomic attributes between motile and nonmotile taxa within each phylum.We included 16S rRNA gene copy number as it is considered a proxy for maximal potential growth rates in bacteria [56].We compiled and identified the genes involved in chemotaxis (Supplementary Data 5) across the genomes of f lagellated and nonf lagellated taxa as a validation given the chemotaxis signaling pathway is an activator of the f lagellar motor system [4].

Estimation of the prevalence of flagellar motility in microbial communities using metagenomic information
We applied a method to estimate the prevalence of f lagellar motility as a community-aggregated trait using metagenomic information from bacterial communities.To this end, we first assembled "mock" metagenomes containing different proportions of genomes from f lagellated and nonf lagellated taxa from the subset we originally used for boosted regression model training.We selected 20 genomes of taxa with empirically verified f lagellar motility capabilities out of a pool of 40 unique genomes spanning the phyla Actinobacteriota, Firmicutes, and Proteobacteria as these are ubiquitous taxa and are well represented in our training data (Supplementary Data 4).We used ART (a next-generation sequencing simulator; [57]) to simulate short (150-bp) shotgun sequencing reads at a coverage of 50% of these genome mixtures.We did not choose higher coverage as soil metagenomic datasets do not usually exceed 50% community coverage [58].We then constructed a DIAMOND (v2.0.7; [59]) database containing the protein variants for each of the 21 selected genes (7-633 variants per gene) identified from the genomic analyses (see above) that were determined to be robust predictors of f lagellar motility, as well as the variants contained in GTDB for the 120 single-copy marker genes that constitute the taxonomic basis of GTDB [27].We ran trimmomatic (v0.39; [60]) to remove adapters and lowquality base pairs using a phred score of 33 as a threshold, only keeping reads above 100 bp after trimming.We annotated the simulated metagenomes using BLASTx (v2.13.0; [61]) on this custom protein database (see code specifics in https://doi.org/10.6084/m9.figshare.25104116.v2).We then filtered out reads with <50 bit score, <60% identity to the reference protein, and an e-value higher than 0.001 based on the outputs.In this way, we obtained a reads-per-kilobase (RPK) index for both the f lagellar gene and the single-copy marker gene sets by taking the median gene lengthcorrected number of hits of each protein across the 21 and 120 unique proteins, respectively.We finally built a "f lagellar motility index" based on the ratio between the f lagellar gene RPK and the single-copy marker gene RPK.The use of single-copy marker genes in this manner offers a general normalization of the f lagellar gene read count-which can vary due to differences in library size, coverage, or diversity-as these single copy genes are assumed to be present in every bacterial genome [62].We then determined the linear relationship between the f lagellar motility index and the proportions of genomes that were able to produce f lagella across mock metagenomes, following a similar approach to [63].We used this standard curve to estimate the proportion of genomes in a given metagenome that are able to produce f lagella based on the f lagellar motility index, as expressed in equation ( 1): % of bacteria with f lagellar motility = 3650 × Flagellar motility index − 0.321 (1) where the f lagellar motility index is the ratio between the median RPK of the 21 f lagellar motility genes over the median RPK of the 120 single-copy marker genes.This method allows the estimation of the prevalence of f lagellar motility in any given bacterial metagenome based on the assumption that f lagellar genes are usually found in single copies among bacterial genomes [ 23].

Testing associations between bacterial flagellar motility and soil carbon availability
We selected metagenomic datasets that covered expected gradients in soil C availability, which we hypothesized to be positively associated with bacterial f lagellar motility.Soil C availability is challenging to measure in situ and direct measurements of soil C availability (which is not equivalent to total C concentrations) are rarely compiled along with metagenomic data.We thus selected datasets that we expect based on published research to span gradients in C availability, recognizing that C availability is often correlated with other soil variables.The datasets included are the following: (i) soils from across the USA spanning gradients in soil depth (surface, 0-20; subsurface, 20-90 cm, n = 66), where total organic C decreases with depth [64]; (ii) a net primary productivity (NPP) gradient across Australia (n = 38, [65]), where higher NPP is expected to be associated with higher soil organic C availability [66]; (iii) a global comparison of rhizosphere and bulk soils associated with citrus plants (n = 20, [67]), where we would expect C availability to be higher in rhizosphere soils than in bulk soils [68]; and (iv) a pot experiment with controlled water inputs comparing the rhizosphere and adjacent bulk soil of wheat plants (n = 24, [69]).Since all these datasets contain factors that likely covary with soil C availability, we additionally obtained metagenomic data from soils that were incubated with or without glucose amendments over a 117-d incubation period in a previous study [70].In this experiment, glucose was added weekly to subsamples of a single soil at a rate of 260 μg C g dry wt soil −1 day −1 (see [70] for full details).Since this experiment was performed under constant moisture conditions and in the absence of plants [70], the glucose amendments should lead to an increase in C availability with minimal direct effects on other soil attributes.The addition of glucose in this experiment led to a 7.9-fold increase in the microbial CO 2 respiration rates [70], confirming that the C available to soil microbes increased in the soils amended with glucose compared to the controls (i.e.soils that received only an equivalent amount of water).
We generated metagenomic data from the nine soil samples harvested from the glucose amendment experiment.For each soil sample (four with added glucose, five without glucose), we used 0.25 g of soil for deoxyribonucleic acid (DNA) extraction using the DNeasy PowerSoil Pro tube kit (Qiagen).The shotgun sequencing library was prepared using Illumina's DNA Prep kit and Unique Dual Indexes (Illumina, CA).Samples were quantified using Qubit and pooled at equimolar concentrations.The library was run on a NovaSeq 6000 (Illumina, CA) at the Texas A&M AgriLife Genomics & Bioinformatics Service (USA) using a 2 × 150 cycle f low cell.Sequence cluster identification, quality prefiltering, base calling and uncertainty assessment were done in real time using Illumina's NCS 1.0.2 and RFV 1.0.2software (Illumina, CA) with default parameter settings.We also analyzed the 16S rRNA gene sequencing information on the same soil communities (see [70] for details on how these data were generated and processed).

Processing of shotgun metagenomic sequencing reads from datasets covering gradients in soil C availability
To process the metagenomic data from all of the datasets described above (157 metagenomes in total), we first downloaded the sequences from the Sequence Read Archive (SRA) of NCBI when applicable, and ran trimmomatic (v0.39; [60]) to remove adapters and low-quality base pairs using a phred score of 33 as a threshold, only keeping reads above 100 bp after trimming.We used BLASTx on the custom DIAMOND database we created to annotate the metagenomic reads.We filtered out reads that had <50 bit score, <60% identity to the reference protein, and an e-value higher than 0.001.We finally measured the f lagellar motility index based on the ratio between the median RPK of the f lagellar genes and the median RPK of the 120 singlecopy marker genes as described above, and fitted equation (1) to quantify the prevalence of f lagellar motility in any given metagenome.

Statistical analysis
All statistical analyses were conducted in R (v4.1.3; [71]).We used principal component analysis (PCA) to visualize how well the presence/absence of the selected f lagellar motility genes was able to discriminate between the genomes of f lagellated and nonf lagellated taxa.To identify potential differences in the life history strategies of f lagellated and nonf lagellated taxa, we used multiple Mann-Whitney U tests with Bonferroni correction for multiple comparisons to investigate whether particular COG categories were overrepresented in genomes from f lagellated versus nonf lagellated taxa.The results were presented as the log2-fold ratio.We used Mann-Whitney U tests to investigate associations between f lagellar motility and the 16S rRNA gene copy number and the number of chemotaxis genes in any given genome due to non-normality of the data.We compared differences in genome size between f lagellated and nonf lagellated taxa using Welch two-sample t-tests.
To test for differences in the prevalence of f lagellar motility between surface and subsurface soils, we used a mixed-effects linear model with location coded as random factor, and for the test between rhizosphere and bulk soils in wheat, we used Welch two-sample t-tests.Since we only had a single rhizosphere and bulk soil observation per site, in the global citrus rhizosphere dataset, we first calculated the difference in the prevalence of f lagellar motility in the rhizosphere over bulk soil at each site and then tested whether these differences were significantly different from zero using a one-sample t-test.These tests were implemented using different arguments of the t.test function in base R [71].We used Pearson's correlations to evaluate relationships between the prevalence of f lagellar motility and NPP and used linear regression to represent the standard curve to quantify the prevalence of f lagellar motility in metagenomes.
We used 16S rRNA gene sequencing information from samples in the glucose amendment experiment [70] to investigate the shifts in the taxonomic composition of soil bacterial communities upon glucose addition.Specifically, we investigated which bacterial Amplicon Sequence Variants (ASVs) responded to glucose addition using ANCOM-BC [72].The taxonomic composition of these bacterial communities was investigated using the "phyloseq" R package (v1.38.0; [73]), and we tested the effect of glucose amendment on the prevalence of f lagellar motility assessed using our metagenome-based method using the Welch two-sample t-test.

Development of a genomic model to predict flagellar motility in bacteria
Since the genes involved in f lagellum assembly are well described and conserved across bacterial groups [23], we were able to use information on the presence/absence of f lagellar genes to predict the capacity for f lagellar motility from genomic information alone.We used genomic data for 1225 bacterial strains known to be motile or nonmotile using strain description information compiled in [26] as training data for a boosted regression machine learning model to predict the capacity for f lagellar motility in bacteria (388 unique strains with known f lagellar motility and 837 unique strains with no f lagellar motility; Supplementary Data 1).We note that being nonf lagellated does not mean taxa are nonmotile.For example, within the Bacteroidota, which had only two f lagellated members in the training data, the majority of aquatic and terrestrial members display gliding motility [74,75].Of the initial set of 35 genes we identified from the literature as being associated with f lagellum assembly (Supplementary Data 2), we found that 14 out of these 35 genes were either not frequently found in the genomes of taxa with experimentally validated f lagellar motility, or occurred in >50% of the genomes of nonf lagellated taxa.As these 14 genes were not useful for predictive purposes, the final model was based on the presence/absence of 21 genes that were sufficiently prevalent across bacterial genomes and less frequently found in nonmotile taxa (Supplementary Data 2).These genes encode different structural parts of the f lagellar apparatus, including the basal body (FlaE, FliL, Flg_bbr_C, Flg_bb_rod), the f lagellar rotor (FliG_C), the f lagellar hook (FlgD, Flg_hook, FliD_C, FliE), or the M-ring (YscJ_FliF_C), as well as multiple proteins for protein export and the f lagellins required for f lagellar assembly (Supplementary Data 2).We verified that the presence/absence of this set of genes could effectively distinguish taxa with f lagellar motility from nonf lagellated taxa using PCA (Supplementary Fig. 1A).
Our model inferred that taxa were able to display f lagellar motility correctly in all taxa with experimentally verified f lagellar motility and that taxa were nonf lagellated correctly in 94.5% of the cases (Supplementary Fig. 1B).We verified that many of the genomes that the model incorrectly predicted as having f lagellar motility belonged to strains whose genomes contain the majority of f lagellar motility genes and have sister taxa that do display f lagellar motility (Supplementary Data 3).We also recognize that a number of strains might express f lagella under certain environmental conditions that would not be captured with the specific in vitro conditions used for strain isolation and phenotyping.The phylum Proteobacteria was overrepresented in the phenotypic trait database (Supplementary Fig. 2), but our predictions of f lagellar motility for this phylum were not necessarily more accurate than predictions for other phyla (Supplementary Fig. 3), as it contained numerous taxa considered to be nonmotile with f lagellated sister taxa (Supplementary Data 3).We recognize that our dataset is over-represented by taxa (particularly those within the Actinobacteriota, Firmicutes, and Proteobacteria) that are readily cultivated in vitro as those are the only taxa for which phenotypic information on f lagellar motility is available.However, given that the genes associated with f lagellar motility are generally well-conserved across a broad diversity of bacteria [23], and given that our model was robust across multiple phyla (Supplementary Fig. 3), we expect that our genome-based model can also effectively predict f lagellar motility for taxa with no available phenotypic information on motility, including taxa not included in our test set.

Prevalence of flagellar motility across a broad diversity of bacteria
We used our validated genome-based model (based on the presence/absence of 21 genes) to determine how the potential for f lagellar motility is distributed across a broad diversity of bacteria, including a wide range of taxa for which no phenotypic information on motility is currently available.We did so to assess the degree to which f lagellar motility is predictable based on taxonomic or phylogenetic information, and to investigate the genomic attributes that are generally associated with f lagellar motility.We predicted the capacity for f lagellar motility for 26 192 bacterial genomes spanning 12 major phyla (covering all highquality genomes in GTDB r207 [27], belonging to the main bacterial phyla; see Materials and Methods).The predicted prevalence of f lagellar motility was highly variable among phyla, ranging from the phylum Spirochaetota, which had the highest proportion of f lagellated taxa (93.2%) to the Deinococcota and Mycoplasmatota, which our model suggests do not have any f lagellated members (Fig. 1A).Among the phyla with the largest number of genomes, we found that the Proteobacteria are predominantly f lagellated (78.3%), with lower proportions for the Firmicutes (54.6%), and very low proportions of f lagellated taxa in the phyla Actinobacteriota (15.9%) and Bacteroidota (0.7%) (Fig. 1A).
The majority of bacterial phyla contain numerous families with both f lagellated and nonf lagellated members (Supplementary Fig. 4).This means that family-level taxonomic information alone cannot necessarily provide robust inferences of f lagellar motility, stressing the need for alternative approaches to evaluate the prevalence of this trait across microbial communities.However, at the genus level, most taxa are either f lagellated or nonf lagellated, indicating that this trait is typically conserved at this level of taxonomic resolution (Supplementary Fig. 5).
Consistent with our taxonomic analyses, the phylogenetic analyses also highlight that the prevalence of f lagellar motility is highly variable at broad taxonomic levels, which was ref lected in a weak phylogenetic signal (phylogenetic D = −0.077,P < 0.001; Fig. 1B).Higher-resolution phylogenetic and taxonomic information can often be useful for inferring f lagellar motility, particularly for those groups that are well characterized (i.e.where information on f lagellar motility, or lack thereof, is available for closely related taxa).However, phenotypic information is often unavailable for the broad diversity of taxa found in To construct the tree, we randomly selected a single genome representative of each family found in each phylum, and predicted the capacity for f lagellar motility in these genomes.Higher bars indicate a greater proportion of genomes within that family that are inferred to have the capacity for f lagellar motility (based on our genome-based model, see Materials and Methods).The scale bar in (B) depicts branch length as substitutions per site.The tree was constructed from the GTDB r207 phylogeny [27].environmental samples, highlighting the utility of the genomebased predictive approach described here that makes it feasible to leverage the rapidly expanding databases of bacterial genomes to comprehensively investigate the prevalence of this trait in microbial communities.

General life history strategies associated with flagellar motility in bacteria
We expect that bacteria with the capacity for f lagellar motility should have distinct ecologies from nonf lagellated taxa.In particular, we expect that f lagellated taxa should be capable of more rapid growth and a greater capacity for carbohydrate degradation than nonf lagellated taxa (i.e. a "resource-acquisition" life history strategy; [32]).By analyzing the 26 192 genomes for which we had inferred the capacity for f lagellar motility, we were able to identify genomic attributes that were consistently associated with f lagellar motility, conducting these analyses separately for each of the six phyla that had sufficient representation of both f lagellated and nonf lagellated taxa (Fig. 2A; see Materials and Methods).Besides the expected overrepresentation of genes for motility and extracellular structures (Fig. 2A), the two gene categories that were consistently over-represented in taxa with the capacity for f lagellar motility were signal transduction mechanisms (linked to chemotaxis) and carbohydrate transport and metabolism (Fig. 2A).The latter observation is consistent with our general expectation that f lagellar motility should be associated with a resource-acquisition life history strategy (sensu Figure 2. Genomic attributes associated with bacteria inferred to have the capacity for f lagellar motility.(A) Functional gene categories that are consistently over-represented in genomes from taxa predicted to be f lagellated or nonf lagellated across the six most dominant phyla that contain >15% f lagellated taxa.Functional categories were defined as COGs.We indicated those gene categories that were not statistically different with gray shading based on Mann-Whitney U tests (P > 0.01).(B) Prevalence of f lagellar motility in genomes derived from environmental metagenomes (MAGs) or single cells (SAGs) ("assembled"), and in genomes obtained from bacterial isolates ("isolate").Numbers on the upper and lower ends of the plot indicate the number of genomes predicted to be nonf lagellated and f lagellated, respectively.(C) Number of 16S rRNA gene copies in genomes of taxa predicted to be f lagellated (n = 12 726) versus nonf lagellated (n = 13 236).(D) Genome size of taxa predicted to be f lagellated and nonf lagellated.(E) Number of genes involved in chemotaxis identified in the genomes of taxa that are f lagellated and nonf lagellated.In (A), n Acidobacteriota = 232; n Actinobacteriota = 4935; n Firmicutes = 4623; n Planctomycetota = 387; n Proteobacteria = 11 089; n Verrucomicrobiota = 285.In (C) and (E), the P value was obtained from Mann-Whitney U tests due to nonnormality of the data.In (D), the P value was obtained from a Welch two-sample t-test, (P < 0.05); n = 26 192 genomes.[ 32]).However, this pattern was only evident in four of the six phyla examined (Fig. 2A).
We found that 51.3% of genomes obtained from cultured isolates were predicted to be f lagellated, compared to only 35.2% of genomes of assembled origin (metagenome-assembled and single cell-assembled genomes, MAGs and SAGs, respectively; Fig. 2B).Because culture collections are generally biased toward faster growing bacterial taxa with adaptations for rapid substrate uptake [76], these results provide additional support for the hypothesis that f lagellar motility is often indicative of a resourceacquisition life history strategy [32].
To complement the gene-based analyses of life history strategies, we also determined the total number of 16S rRNA gene copies per genome as a proxy for maximum potential growth rate in bacteria [56].The number of 16S rRNA gene copies was significantly higher in genomes of taxa inferred to have the capacity for f lagellar motility (Mann-Whitney U P < 0.001; Fig. 2C), but this pattern was only significant in two phyla (Firmicutes and the Proteobacteria, Supplementary Fig. 6A).Genome size was also significantly larger in taxa predicted to display f lagellar motility (Mann-Whitney U P < 0.001; Fig. 2D), a pattern that was consistent across all phyla except for the Actinobacteriota (Supplementary Fig. 6B), and agrees with previous work [29].We additionally verified that f lagellated taxa harbored a significantly higher number of genes for chemotaxis than taxa predicted to be nonf lagellated (Fig. 2E), as we would expect (Fig. 2A; [4]).We first collected whole-genome data for bacterial taxa directly observed to have f lagellar motility in vitro (i).We then made combinations of genomes with and without the capacity for f lagellar motility to create a gradient of the prevalence of f lagellar motility in mock metagenomes (ii).These mock metagenomes were created by simulating shotgun metagenomic reads from the whole genomes (see Materials and Methods).We annotated the metagenomes to identify the 21 f lagellar genes determined from the genomic analyses to be indicative of f lagellar motility along with a set of 120 single-copy marker genes that are found in nearly all bacteria (see Materials and Methods) (iii).Finally, we calculated the gene length-corrected RPK of each of these gene sets and calculated a f lagellar motility index using the ratio between these indices (iv).(B) Linear relationship between the f lagellar motility index calculated as shown in (A) (4) and the proportion of genomes of taxa with f lagellar motility in simulated metagenomes [(A), 2] (n = 14).The y-axis shows a gradient of bacterial metagenomes created by combining different proportions of genomes from bacteria known to be f lagellated or nonf lagellated spanning the phyla Actinobacteriota, Firmicutes, and Proteobacteria.The linear equation resulting from this association can be used to quantify the prevalence of f lagellar motility in any bacterial metagenome.Colored dots indicate the known proportion of f lagellated taxa in the metagenome of the ZymoBiomics microbial community standard.
Together, our genomic analyses suggest that bacteria with f lagellar motility tend to be capable of more rapid growth and the rapid acquisition of organic C substrates, but this pattern is variable across phyla.Consistent with our findings, a recent global classification of life history strategies in bacteria found f lagellar motility to be associated with elevated genomic capacity for carbohydrate metabolism, higher 16S rRNA gene copy numbers, and larger genomes [ 29].Recent studies focusing on soil bacterial communities have had similar findings [39,77], and in aquatic environments, f lagellar motility is considered a signature of copiotrophic lifestyles [30].Overall, our findings suggest that f lagellar motility is often part of a general life history strategy for rapid organic carbon metabolism and high maximum potential growth [31], recognizing that these analyses are based on a biased subset of bacterial diversity [76] given that most of the genomes included in this analysis (83%) were derived from cultivated isolates.

Application of a metagenome-based approach to quantify the prevalence of flagellar motility in bacterial communities
We extended our genome-based method so it could be used to infer the prevalence of f lagellar motility in whole communities.As the prevalence of f lagellar motility is difficult to reliably infer from taxonomic information alone (see above), and because neither genomic data nor phenotypic information is available for many environmental bacteria, we used a metagenome-based method to quantify f lagellar motility as a community-aggregated trait [78].This method is based on calculating the ratio between the 21 genes identified as being indicative of f lagellar motility and single-copy marker genes detected per metagenome (see Materials and Methods and overview provided in Fig. 3A).We first validated this metagenomic approach using simulated metagenomic data (see Materials and Methods).The simulated data were derived by mixing different proportions of genomes from taxa with experimentally verified f lagellar motility capabilities, creating a gradient of metagenomes containing between 0% and 100% f lagellated taxa (Fig. 3B; Supplementary Data 4).This allowed us to obtain a linear equation to predict the prevalence of f lagellated bacteria in any given metagenome based on the summed abundances of the 21 genes indicative of f lagellar motility (as determined from the genomic analyses above) to the summed abundances of single-copy genes shared across nearly all bacteria (using a similar approach to [63]; Fig. 3A).With these simulated metagenomes, the ratio between the median gene length-corrected reads per kilobase assigned to f lagellar and single-copy marker genes was strongly correlated with the proportion of f lagellated taxa in bacterial communities assembled in silico (Pearson's correlation r = 0.99, P < 0.0001; Fig. 3B).We further validated the approach with metagenomic data obtained by sequencing a DNA mixture from the commercial ZymoBIOMICS microbial community standard, which contains known amounts of genomic DNA from different bacterial taxa whose f lagellar motility capabilities are known a priori (see Materials and Methods).We found that this method accurately inferred the proportion of taxa that were f lagellated based on metagenomic information alone (estimated proportion of f lagellated taxa = 52.0%,expected proportion of f lagellated taxa = 48.2%;Fig. 3B).We also verified that our estimates using only the forward reads did not differ from those using the reverse or merged reads (Fig. 3B).Together, these results highlight that we can accurately infer the community-level prevalence of bacterial f lagellar motility in any metagenome of interest simply by calculating the ratio between the sum of the 21 f lagellar genes and the sum of single-copy bacterial marker genes.

Prevalence of bacterial flagellar motility across gradients in soil carbon availability
We used our metagenome-based approach to further test our hypothesis that f lagellar motility is most likely to be associated with taxa adapted for fast resource acquisition under resourcerich conditions (Fig. 2A-C).If this hypothesis is valid, we would expect the community-wide prevalence of bacterial f lagellar motility to be higher in soils with greater amounts of available organic C. Since it is challenging to directly quantify the amount of C in soil that is available to fuel microbial activities, we selected four independent metagenomic datasets that we expect to effectively capture gradients in soil C availability, and the results from the analyses of these datasets are described below.
Soil C availability is expected to decrease with soil depth [79,80].Across the nine soil depth profiles analyzed [64], we consistently observed a higher prevalence of f lagellar motility in the surface (top 20 cm) compared to deeper soil horizons (20-90 cm, linear mixed-effects model, Estimate Surface = 11.88 ± 1.30 (mean ± SD), Estimate Subsurface = 8.64 ± 1.34, P = 0.005, n = 66; Fig. 4A; Supplementary Fig. 7).We recognize that soil C availability is not the only factor that is likely to change appreciably with soil depth.For example, soil water and nutrient availability can also vary with depth [64], and higher bulk densities can compromise motility in deeper soil horizons [81].Thus, we cannot conclude that soil C availability is the only factor responsible for the elevated prevalence of f lagellar motility in surface soil communities.
To further test our hypothesis, we analyzed 38 surface soils collected from across Australia.For this sample set, we assume that NPP is a reasonable proxy for soil C availability, as higher NPP leads to increased plant-derived organic matter inputs to soil [66].We found that across these varied soils, the prevalence of f lagellar motility in bacterial communities was strongly correlated with NPP (Pearson's r = 0.619, P < 0.001; Fig. 4B).As with the soil depth analyses, these findings also support our hypothesis that f lagellar motility is more prevalent in soils with higher C availability.However, as other factors likely covary with NPP (including mean annual precipitation), these findings on their own are not sufficient to confidently support a general association between f lagellar motility and soil C availability.
As both the soil depth and the Australian surface soil datasets indicate an association between inferred soil C availability and f lagellar motility, we then sought to determine the prevalence of f lagellar motility in bacteria from rhizospheres and associated bulk soils.Although many factors differ between rhizosphere and bulk soils, we would expect that soil C availability is one of the more prominent factors differing between these two soil habitats.The rhizosphere receives abundant inputs of available plant-derived C via root exudation [82], and rhizosphere soils generally support higher microbial respiration rates than adjacent bulk soils [38,83].We analyzed two independent metagenomic datasets that compared bacterial communities in rhizospheres and adjacent bulk soils.One dataset contained paired rhizosphere and bulk soil samples across the globe from diverse citrus species ( [67], n = 20), and the other dataset contained samples from a controlled pot experiment with wheat plants ( [69], n = 24).We found that in both datasets, rhizosphere bacterial communities consistently had a higher prevalence of f lagellar motility compared to their adjacent bulk soils (Fig. 4C and D).Across citrus species, the prevalence of f lagellar motility was on average a 11.5% higher in rhizospheres than in bulk soils (one-sample t-test P = 0.012; Fig. 4C; Supplementary Fig. 8), and was higher in the rhizosphere than in the paired bulk soil in 9 out of the 10 sites analyzed.In wheat plants, we also found that rhizosphere bacterial communities contained a higher prevalence of f lagellar motility (Estimate Rhizosphere = 23.7 ± 2.6) than bulk soils (Estimate Bulk soil = 11.3 ± 8.0; Welch two-sample t-test P = 0.0002; Fig. 4D).Although we recognize that other factors could contribute to the elevated prevalence of f lagellar motility in rhizosphere communities, such as plant signaling compounds involved in microbial recruitment [84], these results provide further support for our hypothesis that f lagellar motility is favored under conditions of higher soil C availability, as also indicated by the analyses of the soil depth and the Australian surface soil datasets.

Experimental verification that bacterial flagellar motility is associated with soil carbon availability
To more conclusively test whether soil C availability is associated with the prevalence of bacterial f lagellar motility, we generated new metagenomic data from a 117-day soil incubation experiment where C availability was directly manipulated via regular glucose amendments (see Materials and Methods and Table S1; [70]).This experiment was performed in the absence of a growing plant and under uniform moisture and soil physicochemical conditions, thus minimizing the impact of these potential confounding factors.The prevalence of f lagellar motility in the bacterial communities amended with glucose (15.82 ± 0.89%) was higher than in the soils that did not receive glucose (13.19 ± 1.56%) (Welch two-sample t-test P = 0.017; Fig. 5A).This pattern is in line with the results from the field studies (Fig. 4) and supports our central hypothesis that the prevalence of f lagellar motility is positively associated with soil C availability.The rather small size of these effects is likely due to the fact that relatively few bacterial taxa responded to the glucose addition.Although glucose addition shifted the overall community composition (Fig. 5B), only 28 bacterial taxa (ASVs) out of the total 1203 ASVs detected were significantly more abundant in the glucose-amended soils.These taxa that significantly responded to glucose addition belonged to seven different bacterial phyla (Supplementary Fig. 9).

Conclusions
We have shown that f lagellar motility is a key trait linking C dynamics and microbial communities in soil.Consistent with expectations [31], our genomic analyses reveal that f lagellated taxa tend to be associated with a resource-acquisition life history strategy.This observation was supported by our metagenomic analyses that revealed a positive relationship between the prevalence of f lagellar motility in bacterial communities and soil C availability across multiple, independent datasets.This ).Since each site contained a single bulk soil and a single rhizosphere sample, we indicate which samples come from the same site using connecting lines.To obtain the P value, we calculated the difference between the prevalence of f lagellar motility in the rhizosphere and bulk soil at each site, and then made a comparison against zero using a one-sample t-test.Means are shown as horizontal red lines.(D) Comparison of the prevalence of f lagellar motility in bulk soils and rhizospheres of wheat plants from a controlled pot experiment (n = 24; [69]).The P value was obtained using a Welch two-sample t-test.Statistical significance is set at P < 0.05.relationship between f lagellar motility and soil C availability can be explained based on fundamental energetic constraints, which make f lagellar motility a beneficial trait in environments where C availability is elevated, particularly in spatially structured environments like soil where available C can be patchily distributed [ 36].
The methods to predict microbial traits from genomic information presented here are particularly relevant for traits that are difficult to quantify in situ or for those that require isolation and culturing [85,86].Our metagenome-based approach to infer the proportion of a microbial community harboring any given phenotypic trait would be very useful for this purpose (Fig. 3A).This method can also be applied to investigate processes where f lagellar motility is expected to play an important role, such as microbial colonization and persistence in host-associated microbiomes [87,88].In efforts to improve microbiome management, a better quantification of the prevalence of f lagellar motility in these systems could help identify microbiomes that are likely to be more persistent in the host or more likely to deliver beneficial functions [89].These methods could also be used to explore the prevalence of motility and its associated traits across gradients in C availability in other environments of interest, such as freshwater systems.Overall, genome-based predictive approaches offer opportunities for expanding our traitbased understanding of microbial communities beyond cultivated taxa, and help us understand microbial community patterns across environmental gradients.

Figure 1 .
Figure 1.Taxonomic and phylogenetic distribution of f lagellar motility in bacteria.(A) Prevalence of f lagellar motility in bacterial taxa from the 12 best-represented phyla in a curated database of reference genomes (n = 26 192 genomes as "species clusters" from the Genome Taxonomy Database, GTDB r207).(B) Phylogenetic distribution of f lagellar motility across the 12 bacterial phyla.To construct the tree, we randomly selected a single genome representative of each family found in each phylum, and predicted the capacity for f lagellar motility in these genomes.Higher bars indicate a greater proportion of genomes within that family that are inferred to have the capacity for f lagellar motility (based on our genome-based model, see Materials and Methods).The scale bar in (B) depicts branch length as substitutions per site.The tree was constructed from the GTDB r207 phylogeny[27].

Figure 3 .
Figure 3. Developing a metagenome-based approach to quantify the prevalence of f lagellar motility in bacterial communities.(A) Method overview.We first collected whole-genome data for bacterial taxa directly observed to have f lagellar motility in vitro (i).We then made combinations of genomes with and without the capacity for f lagellar motility to create a gradient of the prevalence of f lagellar motility in mock metagenomes (ii).These mock metagenomes were created by simulating shotgun metagenomic reads from the whole genomes (see Materials and Methods).We annotated the metagenomes to identify the 21 f lagellar genes determined from the genomic analyses to be indicative of f lagellar motility along with a set of 120 single-copy marker genes that are found in nearly all bacteria (see Materials and Methods) (iii).Finally, we calculated the gene length-corrected RPK of each of these gene sets and calculated a f lagellar motility index using the ratio between these indices (iv).(B) Linear relationship between the f lagellar motility index calculated as shown in (A) (4) and the proportion of genomes of taxa with f lagellar motility in simulated metagenomes [(A), 2] (n = 14).The y-axis shows a gradient of bacterial metagenomes created by combining different proportions of genomes from bacteria known to be f lagellated or nonf lagellated spanning the phyla Actinobacteriota, Firmicutes, and Proteobacteria.The linear equation resulting from this association can be used to quantify the prevalence of f lagellar motility in any bacterial metagenome.Colored dots indicate the known proportion of f lagellated taxa in the metagenome of the ZymoBiomics microbial community standard.

Figure 4 .
Figure 4. Prevalence of f lagellar motility in bacterial communities spanning putative gradients in soil C availability.(A) Estimated prevalence of f lagellar motility in bacterial communities found across soil profiles (surface, 0-20 cm; subsurface, 20-90 cm; n = 66).These soil profiles were sampled from sites that covered diverse climatic regions across the USA[64].Group means are shown as white diamonds, and the P value was obtained from a linear model with site coded as a random factor.(B) Relationship between the estimated prevalence of f lagellar motility and NPP across Australia (n = 38;[65]).The shaded area depicts the standard error around the mean.(C) Comparison of the prevalence of f lagellar motility in bulk soils and rhizospheres of citrus trees found at 10 sites across the globe (n = 20;[67]).Since each site contained a single bulk soil and a single rhizosphere sample, we indicate which samples come from the same site using connecting lines.To obtain the P value, we calculated the difference between the prevalence of f lagellar motility in the rhizosphere and bulk soil at each site, and then made a comparison against zero using a one-sample t-test.Means are shown as horizontal red lines.(D) Comparison of the prevalence of f lagellar motility in bulk soils and rhizospheres of wheat plants from a controlled pot experiment (n = 24;[69]).The P value was obtained using a Welch two-sample t-test.Statistical significance is set at P < 0.05.

Figure 5 .
Figure 5. Prevalence of f lagellar motility in bacterial communities from a 117-d soil incubation experiment where soil C availability was directly manipulated via addition of glucose.(A) Estimated prevalence of f lagellar motility in bacterial communities found in soils amended (n = 4) and not amended (n = 5) with glucose as a way to directly manipulate soil C availability [70].The P value was obtained using a Welch two-sample t-test with significance at P < 0.05.(B) Taxonomic composition of bacterial communities from soils amended or nonamended with glucose over 117 days of incubation.The taxonomic composition of the bacterial communities was determined via amplicon sequencing of the 16S rRNA gene (see Materials and Methods).