Genomic Evolution of the Pathogenic Wolbachia Strain, wMelPop

Most strains of the widespread endosymbiotic bacterium Wolbachia pipientis are benign or behave as reproductive parasites. The pathogenic strain wMelPop is a striking exception, however: it overreplicates in its insect hosts and causes severe life shortening. The mechanism of this pathogenesis is currently unknown. We have sequenced the genomes of three variants of wMelPop and of the closely related nonpathogenic strain wMelCS. We show that the genomes of wMelCS and wMelPop appear to be identical in the nonrepeat regions of the genome and differ detectably only by the triplication of a 19-kb region that is unlikely to be associated with life shortening, demonstrating that dramatic differences in the host phenotype caused by this endosymbiont may be the result of only minor genetic changes. We also compare the genomes of the original wMelPop strain from Drosophila melanogaster and two sequential derivatives, wMelPop-CLA and wMelPop-PGYP. To develop wMelPop as a novel biocontrol agent, it was first transinfected into and passaged in mosquito cell lines for approximately 3.5 years, generating wMelPop-CLA. This cell line-passaged strain was then transinfected into Aedes aegypti mosquitoes, creating wMelPop-PGYP, which was sequenced after 4 years in the insect host. We observe a rapid burst of genomic changes during cell line passaging, but no further mutations were detected after transinfection into mosquitoes, indicating either that host preadaptation had occurred in cell lines, that cell lines are a more selectively permissive environment than animal hosts, or both. Our results provide valuable data on the rates of genomic and phenotypic change in Wolbachia associated with host shifts over short time scales.


genome?
It is clear from the coverage graphs shown in Figure 3 in the main text that an approximately 20 Kb region of the genome is triplicated in wMelPop. Here we have attempted to estimate the boundaries of the triplicated unit more precisely.
We mapped all wMelPop 454 reads to the wMel reference genome, and used Perl scripts to extract coverage information from the Newbler output file 454AlignmentInfo.tsv. We then examined two measures of read mapping coverage around the triplicated area of the wMelPop genome (Supplementary Figure S1). Unique coverage at a site is the number of non-duplicate, uniquely mapping reads that align at that site. Total coverage is calculated as the sum of unique reads mapped to that site plus the estimated number of repeat reads mapped to the site. The repeat estimate is made by assigning each repeat read to a randomly chosen instance of the repeat in the genome.
The genomic region extending from the start of WD0507 to the end of the intergenic region following WD0514 has coverage approximately three times greater than average for the rest of the genome ( Figure S1, top panel). It is not clear, however, whether the repeat genes flanking this region -WD0506 at the 5' end and WD0515 and WD0518 at the 3' end -are also part of the triplicated unit. Their estimated coverage is intermediate between that of the triplicated region and the rest of the genome ( Figure S1, second panel), but as any increase in reads matching these repeats due to triplication will be randomly distributed across all repeat copies in the reference genome, an increase in copy ! 3! Figure S1: Coverage depths of wMelPop 454 reads mapped to the wMel genome between coordinates 480,000 and 520,000. Two plots are shown: unique coverage (top panel) and total coverage (second panel). The third panel shows wMel genes surrounding the boundaries of the triplicated region in green; the copy of the IS5 element encoded by WD0516 and WD0517 in wMel is absent from the wMelPop genome so these genes are shown in grey.
! 4! number would be difficult to detect, given the inherent random variation in coverage along the genome. We therefore took a different approach to test whether these flanking repeat genes are part of the triplicated unit, by attempting to estimate the distance separating the known portions of the units.
In the schematic below, each green block represents the unit that is known to be triplicated, which extends from the start of WD0507 (marked Ai) to the end of the intergenic region following WD0514 (marked Bi). The flanking repeat regions are represented as blocks of light orange (WD0506) and dark orange (WD0515 and WD0518). If only the region A to B is triplicated, then at each of the junctions between the triplicated units the gap between the B at the end of one unit (e.g. B1) and the A at the start of the subsequent unit (e.g. A2) will be 0 nt in length ( Figure S2).

Figure S2
On the other hand, if both sets of flanking repeat sequences are also part of the triplicated unit, then the gap between B1 and A2 may be as long as 2700 nt ( Figure S3).

Figure S3
To try and estimate the length of this gap, and thereby infer whether some, all or none of the flanking repeat sequences are included in the triplicated unit, we identified read pairs that spanned the junction between units. We used Perl scripts to scan the Newbler output file 454PairAlign.txt and identify paired-end reads which matched four criteria: (1) each read in the pair could be mapped to a unique location in the wMel reference genome, (2) the mapping location of one read was within 4000 nt of start of the known triplicated unit (i.e.
between wMel coordinates 488,397 and 492,397), (3) the mapping location of the other read was within 4000 nt of the end of the known triplicated unit (i.e. between wMel coordinates 503,580 and 507,580), and (4) the direction of mapping of the reads confirmed that the reads mapped across a junction of the tandem repeat unit, rather than within a single unit. We identified 46 read pairs that met these criteria.
The average distance between properly mapped read pairs for the complete wMelPop dataset is 1923 nt, with a range of 954 to 3872 nt (from Newbler output file 454NewblerMetrics.txt). We used this information, and the mapping locations of the 46 repeat pairs described above, to estimate the length of the gap between B1 and A2 (and between B2 and A3, assuming the two junctions are identical) as follows.
The distance, pair_distance i ( Figure S4), between the mapping locations of a pair of reads, read_lefti and read_righti, will be the sum of the distance from the start of read_lefti to B1, the distance from A2 to the end of read_righti, and the gap between B1 and A2: which can be trivially rearranged to allow us to estimate gap length: gap length = pair_distance i -(B1 -read_lefti) -(read_righti -A2) As we know that pair distance varies for individual read pairs, we can obtain a more accurate estimate of the gap length by averaging across all 46 read pairs that span the gap: If we substitute the average known pair distance of 1923 nt into the equation above, we obtain an estimated gap length between B1 and A2 of 208 nt. Estimates of gap length for individual read pairs range from -1370 to 1235 nt, and the standard deviation of the mean is 636 nt.

! 7!
Negative gap length estimates indicate that, as expected, a range of fragment sizes were sequenced in the 454 library, and some were shorter than the mean size of 1923 nt.
Although the error associated with our estimate of the gap length is large, these data do allow us to exclude two possible scenarios for the boundaries of the triplicated unit. If full-length paralogs of both WD0506 and WD0515/WD0518 were part of the triplicated unit, the gap length between B1 and A2 would be approximately 2700 nt; if only one of the flanking repeats were included the gap would be a little longer than 1300 nt. Both of these hypotheses can be rejected by our data. We cannot exclude the possibility that the gap between B1 and A2 is 0 nt.
However, given that the mean estimate is 208 nt, we believe that it is most likely that a partial fragment of the flanking repeat/s is included in the triplicated unit. This is supported by the fact that we were able to identify a number of single reads matching the boundary between B and WD0515, and between WD0506 and A, but no reads matching a boundary between contiguous B1-A2 sequences, which should be present if no flanking repeat sequence were included in the triplicated unit. We therefore sought to identify the boundaries of the deletion based only on the mapping of read pairs that we could be confident were mapped to those instances of the repeat sequences flanking the deletion. We used Perl scripts to parse the Newbler output file 454PairAlign.txt and identify read pairs that matched two criteria: (1) One read of the pair had a unique mapping location in the wMel genome, and that hit was within 4000 nt of the edge of the deletion as defined by the unique coverage plot (i.e. between ! 9! coordinates 482,855 and 486,855, or between 509,878 and 513,878).
These were considered "anchor" reads.
(2) The other read of the pair mapped to one or more locations in the wMel genome, and at least one of those hits fell between the edge of the deletion as defined by unique coverage and 4000 nt further into the deletion (i.e. between 486,855 and 490,855 or between 505,878 and 509,878). These were "extension" reads.
We chose 4000 nt as the window size for both these criteria as this is approximately the maximum distance between mapped pairs in the ! 10! Figure S5. Coverage depths of wMelPop-PGYP 454 reads mapped to the wMel genome between coordinates 480,000 and 520,000. Details of plot construction are as for Figure S1, above.

Characterization of wMelPop in the Canton-S genetic background
To test whether the D. melanogaster Canton-S background has the ability to supress the life-shortening phenotype typically induced by the Quantitative PCR revealed that the density of the avirulent wMelCS strain in young adult Canton-S female flies was significantly higher than wMelPop strain. In older flies, the density of the wMelPop strain increased significantly to levels approximately 10 fold higher than the wMelCS strain. Our results indicate that Canton-S does not repress pathogenesis, and that phenotypic differences between wMelCS and wMelPop are due to Wolbachia strain rather than host effects.   The progeny resulting from transinfected G1 females was reared to adulthood and screening and selection was continued until the G6 generation. Lines were closed by allowing the males and females of the G7 generation to mate with one another.  XhoI recognition sequences, respectively (underlined).

Methods
ˆ P6 primer contains the 10bp sequence deleted in WD0413 from wMelPop-CLA (underlined), therefore this primer will only amplify from DNA that does not has the deletion.