Ancient hybridization and introgression of an invadolysin gene in schistosome parasites

The parasitic blood fluke Schistosoma haematobium causes urogenital schistosomiasis in humans and is a major cause of morbidity and mortality across sub-Saharan Africa. S. haematobium hybridizes with livestock schistosomes, including S. bovis, however the frequency, direction, age and genomic consequences of hybridization are unknown. We sequenced 96 S. haematobium exomes from Niger and the Zanzibar archipelago. and found evidence of an ancient, introgression event between Nigerien S. haematobium and S. bovis occurring 108-613 generations ago. Between 3.3-8.2% of Nigerien S. haematobium genomes are derived from S. bovis alleles, some of which show signatures of directional selection; the strongest signal spans a single gene in the invadolysin gene family, an M8 metalloprotease associated with parasitic life-history traits.

Introgressive hybridization occurs when hybrid offspring repeatedly backcross with one or both parental types, acting as a conduit for genetic exchange between species. Once present in the "new" genetic background, introgressed loci are broken up through recombination and exposed to selection. Alleles that are deleterious in the new genetic background are purged while advantageous alleles may increase in frequency if they escape loss by drift. Examples of introgression are increasingly common 4 . Hybridization among parasite species is a concern because genes encoding biomedically important traits such as virulence, host specificity or drug resistance may be transferred between species 5 . Schistosomes are dioecious trematodes that cause the chronic and debilitating disease schistosomiasis in humans and animals. Inter-specific hybridization, and the production of viable hybrid offspring, have been demonstrated experimentally between members of the S. haematobium species group 6 . Adult worms live in the blood vessels of their human hosts and cannot be easily sampled directly. However, there is genetic evidence --discordance between nuclear (ITS rDNA) and mitochondrial (cox1) DNA--for natural hybridization of S. haematobium with closely related sister species in Africa 7-11 from genotyping larval parasites. One such hybridization of major concern is between S. haematobium and the livestock schistosome, S. bovis commonly observed in West Africa.
To investigate this natural hybridization at the genomic level, we collected S. haematobium eggs, from human urine samples, and hatched miracidia. We sequenced exomes from 96 miracidia (1 per patient) from Niger (n=48) and Zanzibar (n=48) using whole genome amplification and exome capture (Data S1; Fig S1; 12) . Initial attempts to genotype mitochondrial loci contained within the exome probe set failed in 31 of the Nigerien S. haematobium samples. We therefore used de novo assembly of the mitochondrial genome and cox1 Sanger sequencing to genotype the mitochondrial DNA and found that these samples contained a S. bovis mitochondrial haplotype that was 15-18.05% divergent from that of S. haematobium and therefore poorly captured by our exome capture baits (fig S2; 13) . In all, 65% (n=31) of the Nigerien S. haematobium miracidia examined had a S. bovis mitotype, confirming S. bovis mitochondrial introgression into S. haematobium as previously identified in other West African S. haematobium populations 9,11 . In contrast, the S. bovis mitotype was not present in the Zanzibari S. haematobium population.
The presence of S. bovis mitochondria in Nigerien S. haematobium could result from contemporary hybridization or past introgression events. To differentiate between these two scenarios, we examined 370,770 autosomal, exonic (single nucleotide polymorphisms) SNPs 12,14 recovered by the exome capture probes. In addition to our 96 S. haematobium samples we used sequence data from six closely-related schistosome species in the S. haematobium species group (S. bovis, S. curassoni, S. intercalatum, S. guineensis, S. mattheei, and S. margrebowiei) generated in previous studies 15,16 . The genomes of S. haematobium and S. mansoni are ≥89.4% syntenic 15 .
To take advantage of the more contiguous, chromosomal length assembly and gene annotations available for S. mansoni 17,18 , we aligned the S. haematobium and S. mansoni assemblies to convert the coordinates from their position on S. haematobium contigs to the corresponding position on S. haematobium and other schistosome species suggests we did not have early generation hybrids within our samples. We used ADMIXTURE 19 to assign ancestry proportions to the S. haematobium, bovis, and curassoni samples (table S1). S. curassoni was included due to its close phylogenetic affiliation with S. bovis. Three distinct ancestry components were identified corresponding to S. bovis/curassoni, Nigerien, and Zanzibari S. haematobium. The S. bovis/curassoni ancestry component was present in 16 of the 48 Nigerien S. haematobium miracidia indicating potential low levels of admixture (0.1 < Q > 2.7; Fig 1B), while no admixture was identified in the Zanzibari S. haematobium. The three-population test (f3; 20) indicated admixture in the Nigerien S. haematobium population and identified S. bovis as the most likely source ( Fig 1C). Finally, we confirmed that S. bovis alleles in the Nigerien S. haematobium Smp_127030 is expressed in adult worms in S. haematobium, but expression data is not available for other life stages (except eggs). It is unclear how the S. bovis invadolysin (Smp_127030) allele benefits S. haematobium. We speculate that the introgressed invadolysin is involved in tissue penetration or immune evasion within the mammalian host.
Hybridization between species is a powerful source of evolutionary novelty. Introgression among parasites is a concern 5 , because genes conferring biomedically important traits may be transferred across species boundaries; this has clearly taken place in S. haematobium. While we see no evidence of early generation hybrids, we see a signature of introgression from S. bovis in all Nigerien miracidia examined. We demonstrate that an introgressed M8 metalloprotease (invadolysin; Smp_127030) expressed in adult worms has spread to high frequencies in S. haematobium populations in Niger, and that parasites bearing introgressed S. bovis-derived alleles have a strong selective advantage over those carrying native S. haematobium alleles. Zanzibar, all aspects of sample collections were carried out in the framework of the disease control activities implemented and approved by the local Ministry of Health (MoH) and adopted by regional and local administrative and health authorities.

Materials and Methods
The study participants were informed about the study objectives and procedures. Written consent was obtained from parents prior to sample collection from children. Participation was voluntary and children could withdraw or be withdrawn from the study at any time without obligation. All children were offered PZQ (40 mg/kg single oral dose) treatment in the frame of the following school-based or community-wide treatment carried out by the MoH. haematobium eggs were harvested from each infected urine sample by sedimentation or filtration, put into fresh water and then exposed to light to facilitate miracidial hatching. Miracidia were captured individually, under a binocular microscope, in 3 µL of water and spotted onto indicating Whatman FTA Classic Indicating cards (GE Healthcare Life Sciences, UK) using a 20 µL micropipette, dried and archived in SCAN 36 .
Library prep and sequencing: We used whole genome amplification of single miracidia dried on FTA cards followed by exome capture and sequencing (Illumina 2500) following methods described in 12 . In addition to exome capture, whole-genome sequence data was generated for twelve samples, six from each population (Niger and Zanzibar). Libraries were multiplexed and paired end 150 bp reads were sequenced on a single Illumina NextSeq flowcell. In addition to the whole genome and exome sequence data generated above, we gathered available genome sequence Variant discovery, filtration, and phasing: Sequence reads were trimmed with Trimmomatic v0.36 37 so that the leading and trailing base calls had phred-scaled quality scores greater than 10 and the phred score was greater than 15 over a 4 nt sliding window. After trimming, paired and singleton reads were mapped to the reference S. haematobium genome (SchHae_1.0, GCA_000699445.1, lab strain, originally from Egypt). BWA v0.7.17-r1188 38 and merged into a single BAM file using SAMtools v1.7 39 . GATK v4.0.1.1 40 was used to add read group information and mark duplicate reads from each library. Where possible, complete read group information was added based on information contained within the FASTQ header. In some cases (primarily the public data) pseudo read groups were created to differentiate each library. GATK's Haplotypecaller 40 was used for variant discovery. Variant discovery was restricted to target regions of the S. haematobium assembly using the -L option. Target regions were identified by combining all adjacent, exome probe locations within 500 bases of one another into larger intervals with BEDtools v2.27.1 41 . Each interval was genotyped using GATK's GenotypeGVCFs and combined into a single VCF for filtering using GATK's MergeVcfs. Low quality SNP genotypes were filtered using GATK's VariantFiltration with the following filters: variant quality normalized by depth (QD < 2.0), strand bias (FS > 60.0), mapping quality (MQ < 40.0), mapping quality rank sum (MQRankSum < -12.5), and read position rank sum (ReadPosRankSum < -8.0). Sites with high rates of missing data (>20%), multi-allelic sites, and individuals with low call rates (>15% missing data) were removed using VCFtools v 0.1.15 42 . All indels were excluded from downstream analyses.
The published S. haematobium and S. mansoni assemblies vary greatly in quality, despite a high degree of synteny between the two genomes 15 . We aligned the two genomes using progressiveCactus v0.0 43 using default parameters to leverage the contiguity of the S. mansoni assembly. The HAL alignment file was used to lift SNP coordinates between the assemblies using progressiveCactus' halLiftover module. Multi-position SNPs, those that align between assemblies in something other than a 1:1 relationship, were removed from downstream analyses. We used linkage disequilibrium (LD) decay curves to examine biases introduced during the liftover by comparing LD from SNPs associated with each assembly. SNPs mapping to the S. mansoni autosomes (chr1-7) were extracted using VCFtools v0.1.15. The square of the correlation coefficient (r 2 ) was calculated for all SNPs within 1.5 Mb of each other for each dataset using PLINK v1.90b4 44  Mitotype assignment -Unique filters were used to manage the mitochondrial SNPs. First, the mitochondrial contig was identified from the S. haematobium assembly using the haematobium vs. mansoni whole genome alignment and confirmed using NCBI's BLAST server. Due to low genotyping rates within the Nigerien samples, mitochondrial SNPs were filtered so that the genotyping rate per site was reduced from 85% to 25%. Putative mitochondrial haplotypes (mitotypes) were generated by removing any heterozygous sites, if present. Given previously described rates of mitochondrial introgression and divergence between S. bovis, S. curassoni and S. haematobium, we mapped filtered reads directly to the S. haematobium mitochondrial contig (AMPZ01026399.1) and a previously generated S. curassoni mitochondria sequence (AP017708.1). Full-length S. bovis mitochondrial genomes are not currently available through annealing phase, and a 1 min 72˚C extension, followed by a final 7 min, 72˚C extension cycle.
Amplification and fragment size were confirmed on a 1.5% agarose gel. Bidirectional sequences were generated for each fragment with the Cox1_schist_5' and Cox1_schist_3' primers by the Eurofins (Eurofins-MWG) sequencing service. All cox1 sequences were compared to available sequences on GenBank for species identification.
Population structure, admixture, and ancestry assignment -Summary statistics were calculated using all filtered autosomal SNPs with minor allele frequency greater than 5%. FIS was calculated for each sample using VCFtools and the "--het" option. The scikit-allel v1.1.10 47 Python library was used to calculate FST between the Zanzibari and Nigerien populations of S. haematobium.
Genome-wide values for FST were averaged from blocks of 100 variants and local calculations were generated from sliding windows of 250 kb and 50 kb steps. Relationships between samples were visualized in genotypic space with a PCA of unlinked SNPs. Linked sites within sliding windows of 25 SNPs and a pairwise R 2 value greater than 0.2 were filtered using PLINK v1.90b4 44 .
Since comparison between species can overwhelm population level clustering, separate PCAs were generated from previously published Schistosoma species 15,16 , and the exome data from the S. haematobium miracidia (Niger and Zanzibar) presented herein. Nucleotide diversity (π) was calculated per site and in sliding windows of 50 Kb with 25 Kb steps using VCFtools. Levels of admixture were calculated using ADMIXTURE v1.3.0 19 . The number of populations (K) was explored from 1 to 20 and cross-validation between 1,000 replicates was used to identify the most robust measure of K. The 3-population test f3 as implemented in the scikitallel was used to identify admixture in the Nigerien and Zanzibar S. haematobium populations 20 .
f3 standard errors and Z-scores were calculated from block-jackknife replicates of 100 SNPs. We generations. Prior odds for the neutral models (α = 0) were set at 10:1. To minimize autocorrelation, samples were thinned by 20 and 50,000 samples were taken. Convergence between chains was examined in R using the coda package v0.19-1. A Gelman-Rubin convergence diagnostic score less than 1.1 was defined as an acceptable level of convergence a priori. In addition to BayeScan, we identified regions under selection in both the S. haematobium populations using cross population extended haplotype homozygosity (xpEHH; 27) score as implemented in the rehh v2.0.2 R package 48 . When possible, alleles were polarized against an outgroup (S. margrebowiei) to define the ancestral state. If sites were heterozygous or missing in S. margrebowiei they were excluded. Regions under selection were defined using a -log10(p-value) = 3 an an |xpEHH| > 3. Adjacent loci ≤12.5 kb apart were combined into a single locus.
Briefly, derived (autapomorphic) S. bovis and Zanzibari S. haematobium alleles were identified and used to annotate Nigerien S. haematobium alleles. Introgression blocks were identified by finding the largest stretch of S. bovis or pleisiomorphic alleles that were unbroken by a Zanzibari S. haematobium allele. With these tracts we estimated the number of generations since admixture 23 : Where G is generations, L is the average length of introgression tracts in Morgans and P is the proportion of the genome from the major parent.
We dated time since divergence of the Chr4:20,023,951-20,047,325 locus in the Nigerien S.
haematobium populations using startmrca 28 . We used a uniform recombination rate of 3.4 x 10 -8 and mutation rate of 8.1 x 10 -9 . We centered the analysis on Chr4:20,033,013 and included 1Mb of upstream and downstream sequence. MCMC chains were run for 50,000 generations and limiting proposals to 20 standard deviations. Ten independent chains were run using the parameters described above. Estimates of divergence time were taken by discarding the first 40,000 generations of each run then thinning the remainder to 1 sample per 10 generations.  58 . Gene coordinates were lifted from the S. mansoni assembly to identify which of the Stringtie predicted transcripts was invadolysin (Smp_127030) using progressiveCactus' halLiftover and the whole genome alignment generated (described above).

M8 peptidase and
Given the limited gene expression data available for S. haematobium, we examined expression of Smp_127030 and other invadolysin paralogs in S. mansoni 17 . We used RNA-seq data from miracidia 59 , cercariae 18 , schistosomula (3h and 24h old; in vitro transformation 18) , sporocysts (48h old; in vitro transformation; 59) , juveniles (single sex; 18, 21, 28 day old; 60) , and adults (38 day old; separate males and females from mix infections; 60) . We aligned the data using STAR v2.5.4b 61 16) which was converted to the GTF format using gffread from cufflinks v2.2.1 62 and either an overhang (-sjdbOverhang) of 75 (used for schistosomula and cercariae data) or 99 (used for all other libraries). We used RSEM 63 to compute transcript per million (TPM) counts for each isoform.

Data Availability
Sequence data are deposited in NCBI's under BioProject accessions PRJNA508633 and PRJNA443709. Mitochondrial genomes are available in are deposited in Genbank (MK253567-MK253578).