Uncovering the Genomic Basis of Infection Through Co-genomic Sequencing of Hosts and Parasites

Abstract Understanding the genomic basis of infectious disease is a fundamental objective in co-evolutionary theory with relevance to healthcare, agriculture, and epidemiology. Models of host-parasite co-evolution often assume that infection requires specific combinations of host and parasite genotypes. Co-evolving host and parasite loci are, therefore, expected to show associations that reflect an underlying infection/resistance allele matrix, yet little evidence for such genome-to-genome interactions has been observed among natural populations. We conducted a study to search for this genomic signature across 258 linked host (Daphnia magna) and parasite (Pasteuria ramosa) genomes. Our results show a clear signal of genomic association between multiple epistatically interacting loci in the host genome, and a family of genes encoding for collagen-like protein in the parasite genome. These findings are supported by laboratory-based infection trials, which show strong correspondence between phenotype and genotype at the identified loci. Our study provides clear genomic evidence of antagonistic co-evolution among wild populations.


S1. (A)
The percentage of sequencer reads mapping to either the D. magna host or P. ramosa parasite genome. Each sample corresponds to a single D. magna naturally infected by P. ramosa. Unmapped reads likely represent gut content or microbiota. (B) Scatterplot of coverage across host and parasite genomes. Each point represents a single sample. The y-axis shows the proportion of the reference genome covered by sequencer reads from a single sample, with the red line indicating a 95 % cutoff for inclusion in the final data set. A total of 10 samples were excluded from further analysis based on this criterion. 8 of these samples failed to reach the coverage threshold due to low recovery of pasteuria DNA, while the remaining 2 samples were excluded due to technical issues with library preparation.

S2.
Admixture derived clusters of P. ramosa samples, excluding clear cases of multiple infection. Multiple infections were defined based on elevated within-sample polymorphism (see main text). A) Model cross-validation shows that the majority of variation can be explained by three groups. B) PCA ordination of the three admixture groups, which corresponds exactly to the Alpha, Beta, and Gamma lineages identified through DAPC clustering. C) Admixture scenarios with greater than 3 groups support gene flow within the three DAPC lineages, but not across them. Here K = 7 is shown for example. Note that PC 1 and PC 3 are shown in both sub-panels in order to visualize Alpha substructure.

S3.
Plot of linkage disequilibrium dissolution with distance across the P. ramosa genome. Pairwise LD was calculated for all variants across the genome and then mean values were calculated for 100 bp bins. Results are shown for the C1 reference genome and an additional reference genome generated from the distantly related P. ramosa isolate P21. LD decay is strongly observed across relatively short distance, which is an indicator of ongoing genetic recombination. Note that LD values do not reach zero in this dataset as some of the samples in this dataset are likely clonemates.

S4.
Pasteuria Collagen-Like (PCL) gene positions in the genome assembly of P. ramosa strain C1. Complete PCL triplets comprised by a single member from each of the main PCL subfamilies are highlighted in blue. Alternate shades separate adjacent triplets on the table. Due to the circular nature of the P. ramosa genome, PCLs located at the bottom of the S5. Detailed view of Manhattan plot peaks for co-GWAS model 1. Panel A shows a broad peak that spans > 100 Kb with a clear dissolution of LD with distance. This is the expected signature of a true biological peak when performing co-GWAS with un-thinned whole genome data. The coordinates of the ABC locus region are shown in blue. The wellcharacterized non-homologous region of the ABC locus is shown by a gap in the middle of the locus where few reads are mapped. Haplotypes at this position are so divergent that mapping produces an apparent gap in the middle of the locus, with no statistics being computable. Panel B shows a contrasting, and far less convincing pattern, with the peak represented by one apparently intragenic SNP absent of linked variation. This peak likely represents a mapping artifact or stochastic noise in the dataset.

S6
. Co-GWAS model 3 results filtered to show the top 100 associations for contigs 11 and 18 of the D. magna genome, which respectively contain the C and D loci for pasteuria resistance. Although these loci exhibit similar pattern of associations as shown in models 1 and 2 (Figure 2 main text), the signal is relatively weak and difficult to discern in model 3 compared to the strong peak of association arising from chromosome 5.

S7.
Genotype clusters from DAPC analysis of all variant positions located within the ABC locus. Resistotypes (2 nd and 3 rd columns) are shown relative to the C1 and C19 P. ramosa isolates. Only samples with RR and SS resistotypes (94 % of total) are shown. The minimum number of groups required to segregate resistotypes was 3, perhaps reflecting the two homozygote and the heterozygote genotypes at these loci. S10. DAPC genomic clusters based on the sequence data at each known resistance locus region in the D. magna genome. Only samples with diagnostic resistance phenotype data are shown. Up to 10 genotype clusters could be discerned at each resistance locus, likely representing several haplotypes in various combinations with each other. In all cases, we found that the maximum segregation of resistotypes could be achieved with three clusters, which could be further collapsed into two functional genotype groups representing primarily resistant or primarily susceptible phenotypes. Imperfect correspondences between genotype and phenotype at the D and E loci do not appear to be an artifact of our clustering methodology, and may reflect the influence of yet unknown interacting resistance loci. S11. PCA loading score density across the D. magna genome relative to principal component 2. The vertical lines represent the location of the C, E, and D loci from left to right. The dashed line shows the density function adjusted to correct for the distribution of variants across the genome. PC 2 is strongly driven by local variation at the position of the E locus. S12. Co-GWAS "Megalopolis" style plots for host and parasite genomes. The co-GWAS framework can be considered as a series of single GWAS models that treat the presence of each variant in the paired genome as a distinct "phenotype". These results can be visualized as a series of stacked Manhattan plots which can be viewed simultaneously as if one were looking down through a very large stack of transparent pages (10 4 -10 5 layers for each panel). This approach can be instructive as an exploratory data analysis tool for co-GWAS, but can be misleading when interpreted as a typical Manhattan plot; hence our preference for the slightly different nomenclature. Here we see the correspondence between the raw co-GWAS data visualized in this Megalopolis plot format, and the circular plot format shown in the main text which more explicitly shows the strongest points of interlinkage between the genomes. The colored lines represent the location of the C locus (blue), E locus (green), the newly detected locus (purple) and the D locus (red). PCL gene triplets are shown as vertical lines on the P. ramosa plots.