Genomic Analysis of Plasmodium vivax in Southern Ethiopia Reveals Selective Pressures in Multiple Parasite Mechanisms

Abstract The Horn of Africa harbors the largest reservoir of Plasmodium vivax in the continent. Most of sub-Saharan Africa has remained relatively vivax-free due to a high prevalence of the human Duffy-negative trait, but the emergence of strains able to invade Duffy-negative reticulocytes poses a major public health threat. We undertook the first population genomic investigation of P. vivax from the region, comparing the genomes of 24 Ethiopian isolates against data from Southeast Asia to identify important local adaptions. The prevalence of the Duffy binding protein amplification in Ethiopia was 79%, potentially reflecting adaptation to Duffy negativity. There was also evidence of selection in a region upstream of the chloroquine resistance transporter, a putative chloroquine-resistance determinant. Strong signals of selection were observed in genes involved in immune evasion and regulation of gene expression, highlighting the need for a multifaceted intervention approach to combat P. vivax in the region.

The Ethiopian samples were sourced within the framework of previously described surveys recruiting symptomatic patients with P. vivax-positive blood smears [10]. Samples were collected between May and November 2013 from 4 sites in southern Ethiopia; Arba Minch, Misrak Badowacho, Halaba, and Hawasa (Supplementary Figure 1). Details on the local malaria epidemiology are published elsewhere [10]. In brief, Plasmodium falciparum and P. vivax are the dominant malaria-causing species, with vivax proportions ranging from 28% to 55%. The annual P. vivax parasite incidence (API) in 2012 ranged from 20-82 cases per 1000 population across the sites. The national policy for treating uncomplicated P. vivax infection is chloroquine plus primaquine, whereas artemether-lumefantrine is used to treat mixed-species infections of P. falciparum and P. vivax.
For comparative analysis, previously published genomic data were derived from Thailand, Indonesia, and Malaysia [5,13]. These samples were sourced from symptomatic patients with P. vivax infection attending outpatient clinics in Tak Province, Thailand (2006Thailand ( -2013; Mimika district, Papua Indonesia (2011-2014); and Sabah, Malaysia (2010)(2011)(2012)(2013)(2014)(2015). Estimates of the P. vivax incidence (API per 1000 population) at these times were 15.5 in Tak Province, 69. 2-148.7 in Mimika district, and 0.02-0.32 in Sabah (data from Malaria Atlas Project and Sabah Department of Health). The first-line policy for treating P. vivax infection at the time of the enrollments was chloroquine plus primaquine in Thailand and Malaysia, and dihydroartemisinin-piperaquine plus primaquine in Indonesia.

Sample Processing
Patient sampling involved collection of 2-5 mL venous blood, with leukocyte depleted by cellulose filtration [18]. DNA extraction was performed using commercial kits (Qiagen), and Plasmodium species was confirmed by polymerase chain reaction [19,20].

Whole Genome Sequencing
Leukocyte-depleted samples with ≥50 ng total DNA comprising <90% human DNA were subject to whole genome sequencing, read alignment, and variant calling within the framework of a P. vivax community study in the Malaria Genomic Epidemiology Network (MalariaGEN) [21]. Details on the library preparation, sequencing, and variant calling procedures are provided elsewhere [5]. In brief, sequencing was undertaken on the Illumina Hi-Seq platform using the manufacturer's protocol for pairedend 150-bp reads. Reads were aligned against the P. vivax P01 reference [22] using bwa-mem version 0.7.15 [23], and single-nucleotide polymorphism (SNP) discovery and genotype calling were undertaken using previously defined methods [5], with data derived from the MalariaGEN P. vivax Genome Variation Project release 3.0. A set of 827 902 high-quality bi-allelic SNPs with a variant quality score log-odds >0 and <5% missing calls in high-quality samples (samples with ≥95% calls) were derived from an initial set of 4 084 419 discovered variants. Positions with <5 reads were defined as missing calls. For haplotype-based analyses, either the major or reference allele (at positions with equal allele depths) was used at heterozygote positions. Large copy number variations (CNVs) >3 kbp were detected using a hidden Markov model in pysamstats (http:// github.com/alimanfoo/pysamstats) as described previously [24]. As the accuracy of the hidden Markov model in predicting the breakpoint sequence is imperfect, closer inspection of CNV boundaries was undertaken using Artemis software. The genomic data on the previously published and the new Ethiopian isolates are available in the European Nucleotide Archive (see [5,13] and Supplementary Table 1, respectively).

Data Analysis
Within-host infection complexity was assessed using the within-sample F statistic (F WS ) [25,26]. A threshold of F WS > 0.95 was used as a proxy to monoclonal infection. Ethiopian isolates with F WS < 0.95 were deconvoluted (phased) using DEploid identity by descent (IBD) [27] to determine the within-sample relatedness. DEploid-IBD was run with the default settings using a panel of 4 distinct Ethiopian isolates with F WS > 0.95 (QS0002-C, QS0018-C, QS0028-C, and QS0042-C). Downstream haplotype-based analyses were undertaken on the samples with F WS > 0.95, and additionally explored with the deconvoluted infections. Infection complexity was also assessed using the proportion of runs of homozygosity (RoH) in the isolates from all populations using previously described methods [5].
The number of nucleotide differences between pairs of DNA sequences was calculated at the SNP positions using VCFtools version 0.1.13 [28] and divided by the number of nucleotides in the core genome (21 310 118 bp) to derive an approximation of the genome-wide average nucleotide diversity (pi) for each population. Neighbor-joining (NJ) and principal coordinates analyses (PCoA) were conducted using a pairwise distance matrix calculated using R, and NJ plots were created using iTOL software [29].
Tajima's D was calculated using the scikit-allel package (https://github.com/cggh/scikit-allel). Only genes with ≥10 SNPs across all populations were assessed. Pairwise measures of the genetic differentiation (fixation index [F ST ]) at individual SNPs were calculated using the Weir and Cockerham formula [30]. The PlasmoDB Gene Ontology (GO) analysis tool was used to investigate enrichment of biological processes and molecular functions among the genes within the 1st and 99th percentile of the Tajima's D distributions in Ethiopia and highly differentiated variants between Ethiopia and Asia (http://plasmodb.org).
The Rsb measure of cross-population extended haplotype homozygosity and the integrated haplotype score (iHS) were measured using the R-based rehh package [31]. For iHS analysis, ancestral alleles were derived by mapping Plasmodium cynomolgi reads [32] against the P. vivax P01 reference. Only positions where P. cynomolgi calls were homozygote and matched either the reference or alternative P. vivax allele were included in analysis.
Owing to extensive population structure in Malaysia, iHS, Rsb, and Tajima D were not investigated in this population, and F ST results between Ethiopia and Malaysia were explored secondary to results against Thailand and Indonesia.

High Diversity and Frequent Outcrossing in Ethiopia
The average nucleotide diversity (pi) in Ethiopia (6.47 × 10 -4 ) was lower than Thailand (7.69 × 10 -4 ) and Indonesia (6.78 × 10 -4 ) but higher than Malaysia (4.55 × 10 -4 ). In contrast to the extensive substructure observed in Malaysia with PCoA and NJ analysis, higher levels of outcrossing were apparent in Ethiopia, Thailand, and Indonesia (Figure 2A-D). Similar results were observed with the deconvoluted Ethiopian haplotypes (Supplementary Figure 3A-D).

Varying Prevalence of Drug Resistance-Associated Mutations in Ethiopia
The prevalence of mutations associated with clinical or ex vivo antimalarial drug resistance was investigated in each population (  Table 1). None of the Ethiopian isolates had MDR1 CNVs. For reference purposes, the prevalence of nonsynonymous variants in orthologues of other genes implicated in drug resistance in P. falciparum is provided in Supplementary Data 3, including chloroquine resistance transporter (CRT), kelch-13, plasmepsin IV, multidrug resistance-associated proteins 1 and 2, and multidrug resistance protein 2.

Factor in Ethiopia Using iHS Analysis
Analysis of other gene regions demonstrating evidence of recent directional selection was undertaken using the iHS score. Population-wide analyses revealed 9 regions with strong signals of selection, 2 of which demonstrated weak signals in Ethiopia ( Figure 3; Supplementary Data 4). The largest signal in Ethiopia was a 220-kb region on chromosome 14 comprising 50 genes and overlapping with a previously described signal postulated to reflect selection at an AP2 domain transcription factor (PVP01_1418100) [17]. The other Ethiopian signal reflected a 1-kb region on chromosome 5 comprising a SPRY domain protein (PVP01_0508000). . Trends in the pairwise identity by descent (IBD) between clones in a given infection (as determined by DEploid-IBD) correlated positively with the F WS scores and approximated runs of homozygosity (RoH, not presented). DEploid-IBD found evidence of 2 major clones in QS0025-C, QS0015-C, and QS0004-C, and 3 clones in QS0012-C, QS0011-C, QS0032-C, and QS0031-C. The clones within QS0025-C demonstrated the highest pairwise IBD (0.49), indicative of siblings sharing approximately 50% of their genomes, as illustrated by the long stretches of homology (regions with NRAF approximating 0 or 1) on each chromosome. In striking contrast, the clones within QS0031-C exhibited the lowest pairwise IBD (<0.01%), with no evidence of recent common ancestry, rather reflecting a probable superinfection.

Rsb-Based Signals of Differential Selection Between Ethiopia and Asia in Drug Resistance Candidates
The Rsb metric was used to identify regions with recent directional selection in one population relative to another. Sixteen regions demonstrated strong differential selection between Ethiopia and Thailand or Indonesia, including 5 drug resistance candidates (  .8%, 9%, and 2.9% of the variance, respectively. C and D, Unrooted and rooted neighbor-joining trees, respectively. The rooted tree is presented to illustrate similarity between infections in a given population rather than evolutionary patterns. The PY0120-C isolate from Malaysia, labeled with a star, was used as the ancestral sample; this sample is a suspected imported case that has been shown to have close identity with infections from India and Bangladesh (data not presented).

F ST -Based Differentiation of Nonsynonymous Variants Between Ethiopia and Asia in Genes Involved in Regulation of Gene Expression
F ST analysis was undertaken to identify recent as well as older signals of selection that might not be detected with haplotype-based methods. A total of 306 and 1162 SNPs exhibited F ST > 0.8 in Ethiopia vs Thailand and Indonesia, respectively; among these variants, 126 and 461 conferred nonsynonymous changes (Supplementary Data 5). After Bonferroni correction, there was no significant representation of any GO terms; however, the most enriched class of genes with highly differentiated nonsynonymous variants were those involved in the regulation of gene expression (11/68 genes with GO ID 0010468). Among the known drug resistance-associated determinants, high differentiation was observed between Ethiopia and Thailand in the DHPS A383G mutation.

Tajima D Analysis in Ethiopia
Tajima's D analysis was conducted to identify regions under balancing as well as directional selection in Ethiopia (Supplementary Data 6). There was no significant enrichment of any GO term in either the 1st or 99th percentile of D scores in Ethiopia after Bonferroni correction. CRT and a proximal gene on chromosome 1 involved in ion transport (SCO1, PVP01_0109200) were among the 1st percentile in Ethiopia, suggesting evidence of positive selection. However, among 43 CRT and 15 SCO1 SNPs detected across the 3 populations, only 3 and 4 SNPs, respectively, appeared to be segregating in Ethiopia. Furthermore, neither CRT nor SCO1 were among the 1st percentile of genes in Thailand or Indonesia. A range of genes with various functions was observed among the 99th percentile in Ethiopia, reflecting genes putatively under balancing selection, including AMA1.

High Prevalence of Duffy Binding Protein 1 Copy Number Amplifications in Ethiopia
Copy number amplification was observed in 3 gene regions in Ethiopia; a 28S ribosomal RNA gene (PVP01_0504500) with amplification in 100% infections, an exported Plasmodium protein (PVP01_1470400) in 50% infections, and Duffy binding protein 1 (DBP1, PVP01_0623800) in 79% infections (Table 2; Supplementary Data 7). Deletions were also observed in 2 regions encompassing exported Plasmodium proteins: PVP01_0523500 present in 63% (15/24) of Ethiopian isolates and PVP01_0524900 + PVP01_0525000 in 4% (1/24). While the deletions were only observed in Ethiopia, the amplifications were also observed in Asia. The genetic architecture of the DBP1 amplification, which is putatively involved in P. vivax invasion of Duffy-negative RBCs [35], was explored further. DBP1 copy number ranged from 2 to 5 in Ethiopia, and 2 to 3 in the Asian populations. Artemis-based inspection of the amplification revealed a single 5′ and two 3′ breakpoints reflecting the previously described Malagasy and Cambodian type breakpoints [36,37] in Ethiopia (Supplementary Figure 4). The most common breakpoint was the Cambodian type (63% [16/19]), also common in Indonesia (100% [5/5]) and Thailand (96% [30/31]). Inspection of the monoclonal infections with amplifications revealed a large range of DBP1 haplotypes in Ethiopia (minimum 8 haplotypes based on differences at nonheterozygous positions) ( Figure 5). Furthermore, 38% (5/13) of isolates exhibited haplotype differences (heterozygote positions) between the copies within a given infection. The Thai parasites formed a largely separate cluster from Ethiopia, but with similar patterns of DBP1 diversity within and between infections (minimum 12 haplotypes, 35% [7/20] infections with heterozygotes). The sequence flanking DBP1 also displayed moderate diversity in Ethiopia and Asia, with no evidence of selective sweep dynamics (Supplementary Figure 5).

DISCUSSION
This study presents the first population genomic investigation of P. vivax in Ethiopia, where the largest reservoir of this species in Africa persists. Our results reveal patterns of polyclonality, within-infection relatedness, population diversity, and structure comparable to Thailand and Indonesia, where transmission was stable at the time of sampling. We see evidence of genetic adaptations in various parasite mechanisms, which may facilitate the ongoing persistence of the population. The different forms of selection and adaptive responses that appear to be operating in the population are discussed further. The striking prevalence of DBP1 amplification implies an important adaptive role in the host blood stage of infection in Ethiopia. The amplification has been found at low frequency in Malaysia (4%) and Indonesia (6%), and at higher frequencies in Cambodia (29%), Thailand (31%), and Madagascar (53%) [5,13,36,37]. The prevalence in Ethiopia reached almost , and all monoclonal (F WS > 0.95) samples from Ethiopia plus the 7 major haplotypes derived from polyclonal Ethiopian isolates that were deconvoluted (C) using DEploid identity by descent (IBD) software. Data are presented on 260 982 loci for which derived alleles could be confidently called. The dashed black lines demark the thresholds of -log 10 (P value) > 4: signals supported by a minimum of 3 single-nucleotide polymorphisms (SNPs) above the threshold within 50 kb of one another and with an overall SNP density <10 kb per SNP are numbered. Details can be found in Supplementary Data 4. In brief, the putative genetic drivers include an SPRY domain protein (PVP01_0508000) (signal 1), 3 conserved proteins with unknown function (PVP01_0515000, PVP01_1029200, and PVP01_1455300) (signals 2, 4, and 9 respectively), gamma-glutamylcysteine synthetase (PVP01_0717300), an AP2 domain transcription factor (PVP01_1418100) or ferredoxin (PVP01_1419000) (signal 7), and an amino acid transporter (PVP01_1449600) (signal 8). The driver in the 92-kb region on chromosome 13 (signal 6) remains unclear. *Signals were supported by a minimum of 3 SNPs above the threshold in the population-wide data but only 1 SNP in Ethiopia. 80%, higher than in any population examined to date [5,13,36,37]. The high haplotype diversity in the flanking regions of the DBP1 amplifications implies that it has most likely arisen independently on multiple occasions without being purged from the population. In a previous study in Thailand, a highly prevalent MDR1 amplification that conferred resistance to mefloquine was rapidly purged from the population after the drug was discontinued [24],. The fitness consequence of the DBP1 amplification is less clear. A genetic epidemiology study of DBP1 amplification in Cambodia speculated that it may have   Figure 5. Heatplot illustrating multiple Duffy binding protein 1 (DBP1) haplotypes between samples and divergence between DBP1 copies within samples in monoclonal infections with copy number amplifications. The heat plot presents color-coded genotype calls at single-nucleotide polymorphisms in the DBP1 gene with minor allele frequency ≥1%. Genotypes are presented as reference allele frequencies ranging from 0 in red (homozygote alternative allele) to 1 in blue (homozygote reference allele). Samples are ordered on the y-axis according to their genetic relatedness as per the left-hand phylogram. Sample labels are color-coded according to country. Only monoclonal (within-sample F statistic > 0.95) infections with ≥2 DBP1 copies were included in the analyses. Therefore, heterozygous positions (in orange) reflect differences between the DBP1 copies within a given infection. Two Malagasy-type DBP1 amplifications are labeled; all other amplifications were Cambodian type.
an immune-related role by enhancing antigenic diversity on the RBC membrane [36]. Our assessments in Ethiopia and Thailand also revealed moderately high DBP1 haplotype diversity, with haplotypic differences between copies enriching the diversity in >35% of monoclonal infections. However, the results of a recent functional study imply that the amplification may enhance parasite binding to Duffy-negative reticulocytes using a Duffyindependent invasion pathway [35]. Whether functioning in immune evasion or Duffy-negative reticulocyte invasion, the high prevalence of the DBP1 amplification in Ethiopia warrants close monitoring of this variant across the continent. Antimalarial drug resistance presents another challenge to the containment of vivax in the Horn of Africa. Chloroquine (CQ) remains the first-line treatment for P. vivax blood-stage infection in Ethiopia, but recent surveys have demonstrated declining CQ efficacy against P. vivax, with recurrence ranging from 4% to 22% by day 28 [38][39][40][41]. Although the prevalence of the MDR1 Y976F mutation in Ethiopia (32%) was not as high as that documented in Papua Indonesia or Malaysia, where high-grade CQ resistance (CQR) has been reported [42,43], ongoing surveillance is needed. Furthermore, since MDR1 Y976F mutation is not a prerequisite for high-grade CQR [34], additional markers are needed to facilitate these efforts. Several studies have explored the role of P. vivax CRT, the orthologue of PF3D7_0709000, the primary determinant of CQR in P. falciparum [44]; however, results are inconsistent [45][46][47][48][49]. Interestingly, we found evidence of extended haplotype homozygosity in Ethiopia relative to Thailand in a region upstream of CRT, potentially reflecting selection on a regulator of CRT expression in Ethiopia. While studies in Brazil have reported association between CRT expression and CQR in P. vivax using clinical phenotypes [46,47], the lack of association using ex vivo phenotypes in isolates from Papua Indonesia [45], the epicenter of CQR for this species [50], is not consistent with a pivotal role. However, it is possible that different variants are operating in different populations. Further studies are warranted.
The absence of MDR1 amplification in Ethiopia suggests that mefloquine may be an appropriate treatment alternative for vivax, should CQR continue to rise. However, the high prevalence of DHFR double mutants (94%) raises concerns for pyrimethamine in regimens such as intermittent preventive treatment for pregnant women or infants. Interestingly, in contrast to the Asian populations, there were no triple or quadruple DHFR mutants and <20% prevalence of DHPS resistance-conferring mutants in Ethiopia, reflecting lower selective pressure. Accordingly, the haplotype-based signals of differential selection at the DHFR and DHPS loci reflected directional selection patterns in Thailand and Indonesia, not Ethiopia. It has been postulated that drug resistance variants may emerge more readily in populations with high rates of inbreeding. We did not find evidence of any significant difference in within-infection relatedness between Ethiopia and the Thai or Indonesian populations, implying comparable levels of inbreeding between these populations. Alternatively, differences in drug history or drug usage, which is generally postulated to be higher in Asia than Africa, might have contributed to lower selective pressure in Ethiopia.
Some of the strongest signals of selection in Ethiopia were observed in MSP5 and AMA1. Tajima D analysis demonstrated evidence of balancing selection in AMA1, consistent with high antigenic variation facilitating evasion of host immune pressure on the RBC stage of the parasite. Both AMA1 and MSP5 also exhibited evidence of differential selection between populations, reflecting complex haplotype structures of different strains circulating in different populations.
We also found evidence of adaptive variation in genes involved in regulation of gene expression, including several AP2 domain transcription factors. One of the strongest signals reflected an AP2 transcription factor on chromosome 14 (PVP01_1418100) under directional selection in Ethiopia and Asia. A previous study in Cambodia also reported strong selection on PVP01_1418100 and other transcription factors, highlighting the importance of these regulatory determinants in modulating mechanisms critical to the transmission and maintenance of vivax populations such as gametocytogenesis and hypnozoite dormancy and activation [17].
Although pressure from antimalarial drugs appears to be weaker in Ethiopia than Southeast Asia, the evidence of selection in genes involved in a variety of processes at various parasite life-cycle stages calls for multifaceted intervention approaches to contain P. vivax in the region. Ongoing genetic and genomic surveillance will aid in the detection of new forms of adaptation as they arise in the parasite population.

Supplementary Data
Supplementary materials are available at The Journal of Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.