Estimating between-country migration in pneumococcal populations

Abstract Streptococcus pneumoniae (the pneumococcus) is a globally distributed, human obligate opportunistic bacterial pathogen which, although often carried commensally, is also a significant cause of invasive disease. Apart from multi-drug resistant and virulent clones, the rate and direction of pneumococcal dissemination between different countries remains largely unknown. The ability for the pneumococcus to take a foothold in a country depends on existing population configuration, the extent of vaccine implementation, as well as human mobility since it is a human obligate bacterium. To shed light on its international movement, we used extensive genome data from the Global Pneumococcal Sequencing project and estimated migration parameters between multiple countries in Africa. Data on allele frequencies of polymorphisms at housekeeping-like loci for multiple different lineages circulating in the populations of South Africa, Malawi, Kenya, and The Gambia were used to calculate the fixation index (Fst) between countries. We then further used these summaries to fit migration coalescent models with the likelihood-free inference algorithms available in the ELFI software package. Synthetic datawere additionally used to validate the inference approach. Our results demonstrate country-pair specific migration patterns and heterogeneity in the extent of migration between different lineages. Our approach demonstrates that coalescent models can be effectively used for inferring migration rates for bacterial species and lineages provided sufficiently granular population genomics surveillance data. Further, it can demonstrate the connectivity of respiratory disease agents between countries to inform intervention policy in the longer term.


Figure S12
Hudsons F st across all genomes between each of the four demes for each GPSC.A) calculated from 81 genes, B) from 355 genes, and C) only including the PBP genes (which are likely under selection in each place due to their interaction with penicillinresistance acquisition).A higher F st is a more divergent, separate population, while a lower F st is a more highly mixing population, also known as panmictic.

Figure S14
The posterior parameter distributions across 3 independent runs of 4 chains each across the 6 dominant GPSCs (columns) and 12 parameter estimates (rows).q q q q q q q q q q q q q q q q q q 26 5 8 Relative migration for each deme pair within each GPSC independently.The x-axis indicated the deme and they are grouped by GPSC.The origin location of South Africa is represented in pink, Malawi in yellow, Kenya in Green, and The Gambia in purple.

Supplementary Tables
Table S1 Parameter estimates across all pairs within the two deme model.Values within the square brackets denote the 95% credible intervals.The 'Parameter' is the raw migration parameter estimate while the 'Relative Parameter' is relative to all other deme pairs within each GPSC.The 'Directional Migration Probability' is the probability of migration asymmetrically for each GPSC and each deme pair (ie for sa − mal, GPSC10 there is 0.667 probability of migration while for mal − sa there is (1 − 0.667) probability of migration.

Figure
Figure S2 Recapturing input migration parameters with a 2 Deme model.A) The overlapping posterior density migration parameter estimates for migration parameter 1 -from a population [a − d] to population [a − d] B) and the inverse.The 'true' input parameters were mig a−d =0.1 and mig a−d =0.6.The posterior densities were estimated with a uniform prior and are visualized independently for each parameter (light blue), the true input parameter is indicated by the red vertical line while the median posterior estimate is indicated by the blue dashed line.Deme A=South Africa, initial population size 6000; Deme B=Malawi, initial population size 2000; Deme C=Kenya, initial population size 500; and Deme D=The Gambia, initial population size 1000.*Used no-uturn (nuts) sampling rather than metropolis sampling for mig b c and mig c b due to difficulty converging.

Figure
Figure S3The response of the fixation index to varied migration parameters.Each plot indicates which migration parameter We varied and the F st between the countries for each of those migration parameters is indicated by the colored lines.

Figure
Figure S4 Recapturing migration parameters in the 4 deme model.The True Migration parameter is indicated by the red vertical line while the estimated median parameter is indicated by the bluedashed vertical line.The posterior distribution density is represented by the blue histograms for each deme pair indicated by the title where a=South Africa, b=Malawi, c=Kenya, and d=The Gambia.The input population sizes for each of these scale to the true population size and are indicated in the figure.

FigureFigure S6
Figure S5 Pairwise distance estimates for between-country genomes across all 12,582 genome pairs from South Africa, Malawi, Kenya, and The Gambia, clustered in that order by A) Hamming distance and B) Jaccard Distance.

Figure
Figure S7 Pairwise Hamming distances across all genomes from each of the four demes (organized in the order of South Africa, Malawi, The Gambia, Kenya) for each GPSC in turn.These only include biallelic SNP sites.A) Includes 81 'neutral' genes.B)Includes 355 'neutral' genes.

Figure
Figure S8Pairwise Jaccard distances across all genomes from each of the four demes (organized in the order of South Africa, Malawi, The Gambia, Kenya) for each GPSC in turn.A) includes 81 'neutral' genes, B) includes 355 'neutral' genes.These only include biallelic SNP sites.

Figure
Figure S9 Mutual Information Scores between SNP pairs across 81 'neutral' gene alignments for each of the GPSCs.The vertical dashed line indicates the 1kb cutoff under which removed correlated sites.The horizontal dashed line indicates the 0.2 mutual information score cutoff which has been used previously for the pneumococcus.

Figure
Figure S10 Estimated migration parameters removing correlated sites.Including mig ab on the left and mig ba on the right.Excluding all within a 1kb window upstream with r 2 >0.5 (blue), and a r 2 > 0.05 (green), and retaining all sites (black).The error bars indicate 95% CIs and the x-axis indicates the GPSCs.Initial population sizes were for South Africa (deme A) and Malawi (deme B).

Figure
Figure S11 Pairwise comparison between the Hudson and Weir-Cockerham F st values across all four demes.In total this includes six comparisons, one between each deme and every other deme.

Figure
Figure S13 Convergence of asymmetric 2 deme parameter models.A) The effective sample size (ESS) across all parameters estimated.ESS <100 is indicated in red.B) The posterior density of parameter estimates between South Africa and Kenya for GPSC10.These were unable to converge due to the high co-linearity between them.

Figure S15
Figure S15Relative migration parameters asymmetrically between two deme pairs.

Figure S16
Figure S16Posterior distributions for 6 parameter estimates for each GPSC, colored by parameter.

Figure
Figure S17 Summary of each GPSC migration parameters A) The directional probability from the 2 deme model for each GPSC whereby red= >0.6, blue = 0.4-0.6, and grey = 0.1-0.4probability of migration asymmetrically for each deme pair.The Node colors are described in the legend.B) The weighted migration from the 4 deme model between all 4 demes.The node colors are the same as A. C) The rel-ative migration probability for each GPSC across all demes.The Origin country is colored the same as A and B and the Destination country is indicated in the legend.

Figure
Figure S18The population sizes and distance between countries versus migration parameter estimates.All plots include the migration parameter estimates (y-axis) against either distance between countries or the population size of the countries (x-axis).The left plot includes the distance between migration parameter demes (grey, triangles) and the right plot includes the population size of the origin (blue) or the destination (green).The models associated with each figure are included in lines of the same color.

Table S2 Migration parameter estimates symmetrically across four demes for six GPSCs.
Values within the square brackets denote the 95% credible intervals.The 'Parameter' is the raw migration parameter estimate while the 'Relative Parameter' is relative to all other deme pairs within each GPSC.