Signatures of kin selection in a natural population of the bacteria Bacillus subtilis

Abstract Laboratory experiments have suggested that bacteria perform a range of cooperative behaviors, which are favored because they are directed toward relatives (kin selection). However, there is a lack of evidence for cooperation and kin selection in natural bacterial populations. Molecular population genetics offers a promising method to study natural populations because the theory predicts that kin selection will lead to relaxed selection, which will result in increased polymorphism and divergence at cooperative genes. Examining a natural population of Bacillus subtilis, we found consistent evidence that putatively cooperative traits have higher polymorphism and greater divergence than putatively private traits expressed at the same rate. In addition, we were able to eliminate alternative explanations for these patterns and found more deleterious mutations in genes controlling putatively cooperative traits. Overall, our results suggest that cooperation is favored by kin selection, with an average relatedness of r = .79 between interacting individuals.

To determine if codon usage is more even in social genes or highly expressed genes, we calculate Shannon entropy for each amino acid.If each codon is used evenly, then Shannon entropy will be 0. If one codon is used more frequently than random, then Shannon entropy will be negative.This is analogous to how entropy is used in ecology to calculate species richness.An amino acid mostly being coded for with the same codon is equivalent to one species dominating in a species richness measure.On average, Shannon entropy is much less negative for cooperative genes (-0.215 compared to -0.869 for highly expressed genes), suggesting that cooperative genes have much more even usage of codons, suggesting that there may be relaxed selection for synonymous codon usage in cooperative genes (Supplementary Figure S1.2) Supplementary Figure S1.2:Shannon entropy for cooperative (blue) and highly-expressed (yellow) genes for each amino acid.Cooperative genes also have higher GC3 (GC content at third positions) than our set of highly expressed genes.[0.407 in cooperative compared to 0.322 in ribosomal] which may cause them to exhibit variable expression.At first glace, this finding doesn't fit with our hypothesis that cooperative genes are under kin (relaxed) selection, because high GC content is often indicative of highly expressed genes under strong selection (1).However, preferred codons in B. subtilis are AT biased (7).Our finding that cooperative genes have high GC content is therefore consistent with relaxed selection because cooperative are less likely to use preferred codons.The effective number of codons is also substantially higher in cooperative genes than in our set of highly expressed genes (55.1 compared to 43.4), further demonstrating how cooperative genes are using more codons, and therefore less likely to use preferred codons.We also conducted chi-squared tests to determine for each codon whether it was used significantly more than expected (from random chance) in cooperative genes than highly expressed genes, and found that only 14 codons are used significantly more often than expected by chance in cooperative genes, compared to 22 codons that are used significantly more often than expected for highly-expressed genes.
Overall, there is some evidence that decreased usage of preferred codons may explain the increase in synonymous polymorphism.This is the third study to find the counter-intuitive pattern of increased synonymous polymorphism in social genes (P. aeruginosa (8) ; D. discoideum (9), so more research is needed in this area.

S2: Cooperative vs. background genes
We compared cooperative genes to private genes in the main analysis, and here we also compare to a set of background genes.For the background genes we use those which produce proteins that localise to the cytoplasm, as these are least likely to have cooperative functions.This produces a set of 1832 genes.Here, we present the results of the post-hoc comparison between cooperative genes and background genes for the main set of molecular 1.000We find the same pattern as in the main analysis, with cooperative genes having a signature of higher non-synonymous polymorphism and divergence, without evidence for increased likelihood of positive or balancing selection.This comparison is less well-controlled than the main analysis, but these results add further weight to our conclusion that signature of selection we observed in quorum sensingcontrolled genes is a signature of kin selection.

S3: Balancing selection
A logical explanation for cooperative genes to be more polymorphic than private genes would be that cooperative genes are more likely to be under balancing selection.This could occur if both cheats and cooperators are maintained in a population because the fitness advantage of cheats declines as they become more common (10,11).It could also occur if multiple greenbeard recognition alleles are maintained (12).To detect balancing selection from sequence alignment data, we can use several population genetic measures such as Tajima's D and Fu & Li's F* and D* that look at allele frequencies to determine if balancing selection is occurring.Tajima's D looks at the distribution of allele frequencies, whereas Fu & Li's measures look at singletons (rare variants found in only one strain).To interpret these measures, we use statistical tests to determine if each gene is significantly different from the neutral expectation.We then extract the list of genes with significant support, and test if they are overrepresented for social genes using binomial tests.Tajima's D We use the beta-distribution test with alpha=0.025from the Pegas package in R (13,14) to identify which genes have evidence for balancing selection.The test could be significant either due to balancing selection and a lack of rare alleles (D>>0) or a recent selective sweep (D<<0), so we exclude significant genes where D<0.We have 242 genes with significant evidence for balancing selection.Only one of those is a cooperative gene, so cooperative genes are not significant overrepresented in those under balancing selection (binomial test, p=0.380).That gene is the aminoglycoside resistance gene aadK.

Fu & Li's D* and F*
We use the critical values from (15) for n=100 genes and alpha=0.025,which is 1.53 for D* and 1.73 for F*.The probability of D* or F* being greater than this critical value for by chance is 0.025.Although we have >100 genes, this is likely to be a good approximation as the critical value scales with the natural-log of n.We have 21 genes with significant evidence for balancing selection from Fu & Li's D*.None of them are cooperative.Overall, these results support our conclusions that cooperative genes have higher polymorphism due to relaxed selection, rather than balancing selection.

S4: Positive selection
Cooperative genes might show greater divergence than private genes due to being more likely to be under positive or directional selection.We use several complementary measures to look for signatures of positive selection in our genes.First, we use the Mcdonald-Kreitman (MK) test, which compares the ratio of nonsynonymous to synonymous divergence with the ratio of non-synonymous to synonymous polymorphism (16).If there is lots more non-synonymous divergence than non-synonymous polymorphism, then this could indicate positive selection.The logic is that with strong positive selection, advantageous mutations don't spend much time as polymorphisms, and are mainly detected through divergence.We have 15 genes which have evidence for significant positive selection from the MK test.None of them are cooperative.Secondly, we use the neutrality index, which uses the same information as the MK test, but rather than just a binary significance test, it uses the full information to make a continuous variable ( 16).This can be interpreted by comparing averages and distributions of different groups of genes.We log-transformed neutrality index to normalise, meaning that positive values indicate positive selection.There is no difference in neutrality index between cooperative and private genes, and both also don't differ from the background set of genes (Kruskal-Wallis test, chi-squared=0.06,df=2, p=0.974)Supplementary Figure 4.1.

Supplementary Figure 4.1:
Log-transformed neutrality index of private and cooperative genes.The dashed line shows the median for background genes.We also use the direction of selection statistic, which again uses the same information as the MK test to make a continuous variable, with positive values indicating positive selection.Whilst there is a slight trend for cooperative genes to have a higher direction of selection statistic than private genes, this is not significant (Kruskal-Wallis test, chi-squared=0.09,df=2, p=0.955)Supplementary Figure 4.2.Supplementary Figure 4.2: Direction of selection statistic for private and cooperative genes.The dashed line shows the median for background genes.Overall, these findings support our conclusion that cooperative genes are not more divergent than private genes due to being more likely to be under positive selection.

Mcdonald-Kreitman conservative
The normal Mcdonald-Kreitman test uses all synonymous differences as the basis of the neutral expectation.Here, we devised and used a modified form of the test named Mcdonal Kreitmanconservative, which considers only the subset of synonymous differences that are most likely to be neutral.We use the logic that synonymous mutations are most likely to be under selection if they switch between preferred codons and unpreferred codons, as codon usage is known to be important in the function and expression of a gene (5).We therefore define a 'conservative' difference as once that either connects two preferred codons, or connects two un-preferred codons.This is inspired by a similar modification to Dn/Ds analysis devised by Zhou et al (17).We used data on preferred codons in B. subtilis from Shields and Sharp (1987), which is stored in the R package seqinR (18).We used the R package VariantAnnotation (19) to extract the codon used in each mutation, and a custom script to categorise synonymous mutations as either conservative or not-conservative.We find that nine of the QS-controlled genes have evidence of significant positive selection, none of which are cooperative.We also used our method to calculate 'conservative' measures of the Neutrality Index and Direction of Selection statistic, which use information from the Mcdonald Kreitman test.We find no difference between cooperative and private genes for either of these measures (Neutrality Index Wilcoxon rank sum test W=518, p=0.570Supplementary Figure S4.

Gene length
It is well-known that gene length can affect molecular population genetic parameters such as polymorphism.Even though we calculate all measures per site (considering gene length), we also check here that our results aren't an artefact of any differences in gene length between cooperative and private genes.Cooperative genes are on average 19% longer than private genes (965 base-pairs compared to 812).We conduct a small analysis where we remove the smallest 25% of genes from our analysis, which is those genes which are <450 base-pairs long.Cooperative genes are still significantly more polymorphic and divergent than private genes using this reduced dataset (Kruskal-Wallis test, chi-squared=8.16,df=2, p=0.017.Dunn Test p=0.01)

Horizontal gene transfer / pangenome
We focused our analysis on chromosomal genes, because frequent horizontal gene transfer can make drawing conclusion from molecular population genetic parameters more challenging.We check if cooperative genes are more likely to be horizontally transferred by checking if they are overrepresented in the accessory genome compared to the core genome.B. subtilis has an open pangenome, meaning that each additional strain sequenced adds many new genes.This is usually indicative of a species living in multiple or variable environments, which we know is true for this species.B. subtilis is also naturally competent, meaning they can take-up DNA from the environment, and it is thought that the open pangenome occurs because rare genes are acquired in this way from closely related species (20).We use the panX database (pangenome.org) to assign genes as core or accessory genome based on 80 B. subtilis genomes.If we count core genes as those present in 90% of genomes, then 23.2% of the genes in our reference strain are in the B. subtilis accessory genome, and 76.8% are in the core genome.10 out of 53 cooperative genes are in the accessory genome, which is 18.9%.Cooperative genes are therefore not overrepresented in the accessory genome (binomal test p=0.622).We conclude that cooperative genes are not more likely to be transferred horizontally.This makes sense in light of recent work, which has shown that cooperative genes are not more likely to be carried on plasmids compared to chromosomes across bacteria, including in B. subtilis (21).Theory also tells us that cooperation is not favoured by horizontal gene transfer, and so we don't expect cooperative genes to be overrepresented on plasmids or other mobile genetic elements (22).

Power analysis
We conducted a power analysis to see what would happen if we had fewer strains in our population genetic analysis.We took advantage of the vcf-tools software, which allowed us to randomly remove strains from our analysis.This enabled us to conduct a basic analysis on polymorphism for groups of N strains (N = 4,6,8,10,12,14,16,18,20,22,24,26,28), with 22 iterations of each number of strains Within any analysis, we focussed on the 3570 genes which are present in all strains, and calculated mean and median polymorphism (average pairwise polymorphism, relative to gene length) (Supplementary Figure S5.1).

Supplementary Figure S5.1: (A) Median polymorphism, (B) Mean polymorphism as the number of strains uses in the analysis varies. The line is a loess regression fit
The graph on the left shows median polymorphism, which shows a threshold effect once we get to 16 strains, and also much smaller error bars.This shows that a smaller number of strains will miss a lot of the diversity between strains.The graph on the right shows mean polymorphism.The main pattern is that the standard error in mean polymorphism declines substantially as the number of strains increase.This means that the likelihood of getting a good estimate of the population mean increases as the number of strain increases, which makes logical sense.The fact that mean polymorphism doesn't change much, but median does, implies that the distribution has changed.The overall conclusion is that the number of strains we have included is likely appropriate to capture the true variation in the population

S6: Division of labour
We conducted an analysis to see if the division of labour in the production of public goods in B. subtilis could explain the signatures of selection that we observe.For this, we look at the subset of 40 genes which are controlled by Spo0A, which is active only in a subset of cells (23).Depending on whether Spo0A is on or off, the expression of extracellular polysaccharides and amyloid protein fibres is either repressed or not (24,25).Spo0A controlled genes have lower median polymorphism than other quorum sensingcontrolled genes, whether measured as overall polymorphism (0.0045 vs. 0.0101), nonsynonymous polymorphism (0.0015 vs. 0.0037), or synonymous polymorphism (0.016 vs. 0.021).Spo0A controlled genes also have lower non-synonymous divergence (0.013 vs. 0.023) and lower ratio between non-synonymous and synonymous divergence (0.049 vs. 0.097), although they have slightly higher synonymous divergence (0.286 vs. 0.266).The direction of selection statistic is very similar between the two groups (-0.044 vs. -0.037).This pattern is the opposite to what we would expect if the division of labour was responsible for the signature of selection, rather than sociality per se.If the lower conditional expression was having a large effect, then we would expect these genes to have higher polymorphism than the background set of quorum sensing-controlled genes.Further, EPS and TasA genes stand out within this class of genes as having high polymorphism, implying that the social effect is important in causing the signature of selection that we observe.

S7: Competence
The reference strain NCIB 3610 has a plasmid-encoded gene ComI, which interferes with the competence machinery (Konkol 2013).We know that there is variation in natural competence in our strains, as they were all screened for competency by (26).18 out of the 31 strains we used are genetically competent.Some of the competence genes (comGF, comGE, comGG) have extremely high polymorphism.
Here, we conduct a small analysis where we restrict our analysis to only competent strains, to see if this polymorphism is caused by disuse in non-competent strains.We find that the competence operon comG is still amongst the most polymorphic even when we are only looking at competent strains.Furthermore, cooperative genes are still significantly more polymorphic than private genes in the competent strains (Kruskal-Wallis test, chi-squared=14.7,df=2, p<0.0001).Overall, we conclude that variation in natural competence isn't responsible for the difference between cooperative and private genes that we observe.

S8: Estimation of relatedness
According to theory, the degree to which selection is relaxed in cooperative genes relative to private genes is inversely proportional to relatedness.This result emerges from a simple population genetics model, which shows that a slightly deleterious allele with a cooperative effect on fitness will reach equilibrium frequency inversely proportional to the relationship between the actor and recipient (r) (27).If we assume weak selection and a large population (and ignore higher-order terms), we can directly map this prediction to relative levels of nucleotide polymorphism between alleles with cooperative and private effects on fitness.As we noted in our previous work on P. aeruginosa, we have to make further assumptions that our set of cooperative genes experience the same average strength of selection and distribution of fitness effects as private genes, but this approach has the advantage of many experimental attempts to estimate relatedness for social interactions in that we don't need to know the scale at which interactions take place, or the relative weighting of environments in which traits are more or less social.Cooperative genes have a median polymorphism of 0.0122, and private genes have median polymorphism of 0.0096.This leads to a calculation of relatedness as r=0.79.We note that this measure might vary depending on how you define the population, which is tricky in bacteria due to horizontal gene transfer and other complications (28).In B. subtilis, for example, we know that strains from the same plant root or gram of soil can vary in their production of public goods, and don't always group together phylogenies (29)(30)(31).

S9: Correlation in gene expression for secondary comparisons
Here, we use the dataset from Futo et al. 2020 to test if the genes we used in the secondary comparisons also tend to be co-expressed.For each set of genes we first calculate the mean pairwise correlation between all possible pairs of genes.We then use a bootstrap approach of randomly sampling 10,000 other gene sets of the same size, to see if the gene set has higher correlated expression than expected by chance.For all gene sets, we find that the correlation is greater than >95% of randomly sampled gene-sets (Supplementary Table S9.1)Supplementary Table S9  The difference between (4) and ( 5) is conditionality.This is a comparison between cooperative extracellular matrix genes known to be expressed only in a subset of cells (epsA-O & tapA-sipW-tasA) to other cooperative QS genes.The level of conditionality had no effect (t-test t=0.0438, df=13, p=0.966) (Figure S15.2).

Figure S15.2:
Nucleotide polymorphism for two groups of genes that differ in levels of conditionality from low (left) to high (right).Each point is a gene, with the bar representing the mean for that group.

Tests of the effect of sociality
The difference between (3) and ( 4) is sociality.This is a comparison between Spo0A-controlled QS genes (extra-conditional private genes) with extracellular matrix genes (extra-conditional cooperative genes).The cooperative genes had a significantly higher polymorphism than private genes (t-test t=-4.150,df=36, p<0.001) (Figure S15.3).

Figure S15.3:
Nucleotide polymorphism for two groups of genes that differ in levels of sociality from private (left) to cooperative (right).Each point is a gene, with the bar representing the mean for that group.The black bar and * represents the significance of pairwise comparisons using a t-test.

A combined model
To test the effects of sociality and conditionality, we also made a linear model that combined the data from these groups.The overall model was significant (ANOVA F3,400=15.45,p<10 -8 ).We found that higher conditionality and higher sociality both led to significantly higher polymorphism, as predicted by theory (conditionality p<10 -6 ; sociality p<10 -4 ).However, sociality explained almost twice the amount of the variation in polymorphism compared to conditionality (6.76% vs 3.83%).The jump from non-conditional to QS-controlled seems to be important, but the extra conditionality of some QS-controlled genes doesn't have an effect on polymorphism.This gives us confidence that we are able to separate the two effects.

Supplementary Figure 4 . 4 :
3; Direction of selection Wilcoxon rank sum test W=1068, p=0.563Supplementary Figure S4.4).Supplementary Figure 4.3: Log-transformed neutrality index of private and cooperative genes.The dashed line shows the median for background genes.Direction of selection statistic for private and cooperative genes.The dashed line shows the median for background genes.

Figure S15. 1 :
Figure S15.1:Nucleotide polymorphism for three groups of genes that differ in levels of conditionality from low (left) to high (right).Each point is a gene, with the bar representing the mean for that group.The black bar and * represents the significance of pairwise comparisons from a Tukey's HSD post-hoc test.The difference between (4) and (5) is conditionality.This is a comparison between cooperative extracellular matrix genes known to be expressed only in a subset of cells (epsA-O & tapA- .1: Results from pairwise correlation in gene expression of a gene set, and bootstrap iterations of randomly sampled gene sets of the same size

Table 7 :
Cooperative and private genes for the production of antimicrobials