Whole-Genome Scans Provide Evidence of Adaptive Evolution in Malawian Plasmodium falciparum Isolates

Background Selection by host immunity and antimalarial drugs has driven extensive adaptive evolution in Plasmodium falciparum and continues to produce ever-changing landscapes of genetic variation. Methods We performed whole-genome sequencing of 69 P. falciparum isolates from Malawi and used population genetics approaches to investigate genetic diversity and population structure and identify loci under selection. Results High genetic diversity (π = 2.4 × 10−4), moderately high multiplicity of infection (2.7), and low linkage disequilibrium (500-bp) were observed in Chikhwawa District, Malawi, an area of high malaria transmission. Allele frequency–based tests provided evidence of recent population growth in Malawi and detected potential targets of host immunity and candidate vaccine antigens. Comparison of the sequence variation between isolates from Malawi and those from 5 geographically dispersed countries (Kenya, Burkina Faso, Mali, Cambodia, and Thailand) detected population genetic differences between Africa and Asia, within Southeast Asia, and within Africa. Haplotype-based tests of selection to sequence data from all 6 populations identified signals of directional selection at known drug-resistance loci, including pfcrt, pfdhps, pfmdr1, and pfgch1. Conclusions The sequence variations observed at drug-resistance loci reflect differences in each country's historical use of antimalarial drugs and may be useful in formulating local malaria treatment guidelines.

<5 years old, 34% of outpatient visits by children of all ages, and 40% of hospital deaths. Year-round malaria transmission occurs in almost every part of the country and peaks during the annual rainy season, from December to May [2]. Since 2005-2007, Malawi and its external donors have scaled up malaria control interventions; coverage for long-lasting insecticide-treated bed nets (LLINs) has reached 60%, intensive indoor residual spraying (IRS) has been launched and expanded in districts with high transmission rates, and artemisinin-based combination therapy (ACT) has replaced sulfadoxine-pyrimethamine (SP) as the first-line treatment for malaria [1,3]. However, childhood cases of malaria have not declined since 2001 [3], and the overall prevalence of anemia and parasitemia have not reflected the scaled up access to malaria interventions. Recently, the prevalence of parasitemia in rural children was higher than that in urban children (47% vs 15%), and the prevalence in the central region of Malawi (50%) was higher than that in the southern (42%) and northern (23%) regions (Malawi National Malaria Indicator Survey, 2010, http://files.givewell.org/files/DWDA% 202009/AMF/Malawi_MIS_2010_Final.pdf ). On the other hand, scaled up malaria interventions (eg, LLIN use) have benefitted pregnant women by reducing the prevalence of peripheral parasitemia (from 23.5% to 5.0%) and placental malaria (from 25.2% to 6.8%) [4].
An integrated approach to malaria control is essential, and the application of genomics is one area that can provide biological insights into important evolutionary forces in P. falciparum. An improved understanding of the parasite's genetic diversity, population structure, and natural selection will have practical implications for developing methods of disease control. In particular, genetic variation enables the parasite to overcome host immunity, antimalarial drugs, and vaccines to establish persistent infections and increase transmission [5][6][7]. In this study, we analyzed whole-genome sequence variation in 69 P. falciparum isolates obtained from children in the Chikhwawa district of Malawi to investigate the genomic epidemiology of P. falciparum in this area and to explore the impact of host immunity and antimalarial drugs on the parasite population. We have used allele frequency-based approaches to infer the parasite's demographic history in Malawi and discover genetic loci likely to be under balancing selection [8]. Results from our analysis of genetic diversity, linkage disequilibrium (LD), and multiplicity of infection (MOI) are consistent with the high level of malaria transmission in Chikhwawa District. Comparison of sequence variation in parasite populations from Malawi and Kenya, Burkina Faso, Mali, Cambodia, and Thailand, using haplotype-based tests and population differentiation metrics (F ST ), identified signals of directional selection at known drugresistance loci, including pfcrt, pfdhps, pfmdr1, and pfgch1. These findings highlight potential roles of genomics in guiding malaria control efforts, such as monitoring of key drug biomarkers and informing drug policy, and changes in parasite populations structure that correspond to and can predict changes in malaria epidemiology.

Study Site and Patients
The study was performed in southern Malawi's Chikhwawa District (16°1′ S, 34°47′ E), a rural area of intense perennial malaria transmission (entomological inoculation rate, 183 infective bites/person/year) [9]. It is approximately 70 m above sea level, divided throughout its length by the Shire River, and prone to flooding during the wet season. It has a tropical climate with a mean annual temperature of 26°C, a single wet season from December to May, and an annual rainfall level of approximately 770 mm [9]. With an annual average infection prevalence in 2-10-year-old children that exceeds 40% [10], Chikhwawa District has one of the highest malaria transmission rates and is 1 of 12 sites in Malawi chosen for intensive antimalarial interventions: IRS, extensive LLIN coverage, and ACT. SP is used for intermittent preventive treatment in pregnancy (IPTp) [11].
Permission to conduct the study was granted by ethics committees of the College of Medicine, Malawi and the Liverpool School of Tropical Medicine. Written informed consent was obtained from a parent or guardian of each child.

Sample Collection and Processing
Between December 2010 and July 2011, 93 whole-blood samples were collected from children participating in a clinical trial at Chikhwawa District Hospital. Blood was depleted of leukocytes by CF11 column filtration [12], and genomic DNA was extracted using the QIAamp DNA Blood Midi Kit (Qiagen). Human and P. falciparum DNA levels were quantified using Pi-coGreen analysis and quantitative real-time polymerase chain reaction (PCR), using the Applied Biosystems stepOne RT-PCR system [13]. Samples used for sequencing yielded >50 ng of DNA and had <80% human DNA contamination. Samples were sequenced by the Illumina Genome Analyzer IIx or the Illumina HiSeq 2000, using the manufacturer's recommended protocol [14], with a minimum of 76-bp paired-end reads.
Data Processing: Alignment, Single-Nucleotide Polymorphism (SNP) Discovery, and Quality Filtering Detailed description of the analysis pipeline has been described elsewhere [15,16]. Briefly, short reads for all 93 samples were mapped to the 3D7 reference genome (version 3.0), using SMALT (http://www.sanger.ac.uk/resources/software/smalt) with default parameters, and SNPs were called using SAMtools (http://samtools.sourceforge.net). This process identified 115 965 SNPs across the 93 samples, 24 of which were discarded because of very low coverage (average coverage across the whole genome, <10-fold). For the remaining 69 samples (average coverage across whole genome, >35-fold), we retained 88 655 high-quality SNPs (76.4%) in their nuclear genomes that met the following criteria: (1) biallelic; (2) quality scores of >30 (error rate, <1 per 1000-bp); (3) not in genomic positions at the very extremes of the coverage distribution (sample average coverage, <10-fold or >2000-fold), which could reflect deletions or copy number variants, respectively [17]; (4) not located in subtelomeric regions, the hypervariable var, rifin, and stevor gene families, or regions of low uniqueness; and (5) no SNP positions with ≥3 problematic genotypes (missing or mixed). Uniqueness was calculated by a sliding 54-bp window of contiguous sequence across the 3D7 reference genome and detecting the presence of this motif elsewhere in the genome. Only SNPs that were in unique positions were retained. Genotypes at SNP positions were called using ratios of coverage. Heterozygous calls were converted to the majority genotype if the coverage ratio was 80:20 or greater [15,18], and the resulting majority allele data were used for further analysis. The filtering of samples and SNPs with mixed genotypes minimized any potential effects of multiplicity of infection. Progeny of the HB3 × DD2 cross (n = 25; 35 832 SNPs [19]) and other population data (Kenya, n = 37; Burkina Faso, n = 40; Mali, n = 40; Cambodia, n = 80; Thailand, n = 80; 294 187 SNPs; [16]) were processed in the same way. The 4 mitochondrial SNPs (mt772, mt1692, mt4179, and mt4952) were extracted from the alignments and the genotypes called as described above. Public accession numbers for sequence data are contained in SRA studies (ERP000190 and ERP000199; http://www.malariagen.net).

Population Genetics
For the high-quality SNPs (n = 88 655) in the Malawian population, we estimated the genetic diversity by calculating the average pair-wise nucleotide diversity (π). We used DnaSP [20] software to compute the allele frequency-based Tajima D test [8] and Fu and Li's D and F metrics [21] to identify genes under balancing selection. Results from Fu and Li's D and F metrics correlated highly with those from the Tajima D test (Spearman ρ, 0.85) and were not analyzed further. To detect signals of directional selection, the integrated haplotype score (iHS) [22] was used, and its P values were computed from standardized values based on a 2-tailed conversion from a Gaussian distribution [23]. The MOI in Malawi was estimated using a novel method of counting the unique haplotypes formed by polymorphism on paired sequencing reads [24]. We estimated recombination rate in this population using 2 progeny crosses. First, rates from the 7G8 × GB4 (Ghana × Brazil origin) were accessed [25]. Second, rates from the HB3 × DD2 (Honduras × Indo-China origin) cross [19] were estimated using the R QTL library (http://www.rqtl.org). For comparisons between populations (Malawi vs others, n = 294 187 SNPs), we first applied the principal component analysis (PCA), based on a matrix of pair-wise identity by state values, followed by the cross-population long-range LD method, XP-EHH [26] and population differentiation metric F ST [27].
P values for the XP-EHH estimates were calculated using a Gaussian approximation. A significance threshold of P < .0006 was established for both iHS and XP-EHH, using a simulation approach. We used the ranked F ST statistics to identify the informative polymorphism driving the clustering in the PCA. LD was measured using the r 2 metric [28], calculated for pairs of SNPs with different physical separation up to 10-kb, using a sliding window approach. The R statistical package was used to analyze the results.

Genetic Diversity and Signatures of Selection Within the Malawian Population
We identified 88 655 high-quality biallelic SNPs (approximately 1 SNP per 260 bp, compared with approximately 1 per 266 bp among isolates worldwide [18]) in robust genomic regions of the 69 samples with good sequence coverage. The overall π in Malawian samples (0.00025; 95% confidence interval, .00023-.00026) is consistent with estimates of high genetic diversity observed in African samples, especially in regions of high transmission [5,23,29]. We observed high variability in π across chromosomes (minimum, 0.00011, chromosome 3; maximum, 0.00037, chromosome 4), probably because of variation in recombination rates. Analysis of antigenic regions of extremely high diversity (200 genes; average π, 0.00050) led to a minor decrease in diversity (average π, 0.00023; P = .035, by the Wilcoxon signed rank test on differences). Regions with elevated diversity may encode an antigenic or polymorphic locus that may be useful for vaccine approaches. Modest positive correlations were observed between recombination rate and diversity (Spearman ρ, r = 0.075 for 7G8 × GB4; r = 0.101 for HB3 × DD2).
To examine the potential effects of recombination, we looked at levels of LD in the Malawian population, which decayed rapidly within a few 100 bp and reached a baseline level within 500 bp. The decay in LD was more rapid in Malawian parasites than in Asian parasites, as previously reported (MAF, 10%; Figure 1) [16,18]. The lower LD in Malawian samples suggests high levels of outcrossing (and effective recombination), as expected in areas of relatively high transmission intensity [23]. Indeed, a high EIR (183 infective bites/person/year) has previously been described for this Malawian population [9], consistent with a previous report of high estimated MOI (median, 2.7; range, 1-10) [30], and was higher than 2 Southeast Asian populations (medians, <1.5) [24]. In this latter study, <10% of samples had multiple infections, compared with approximately 50% of African samples, also reflecting differing transmission patterns [24]. Further examination of LD identified 40 genomic regions (containing at least 5 SNPs) with high average r 2 values (>0.5) that extended beyond 1 kb (Supplementary Table 1). We found no evidence of low recombination rates in these regions, and none of the genes were in the lowest 10th percentile of the recombination rates distribution for either genetic cross. One particular gene, piesp2 (PF3D7_0501200, which encodes a parasite-infected erythrocyte surface protein), showed a pattern of LD extending to 1 kb, suggesting that this region is under recent positive selection.
Computation of iHS for the Malawian SNPs identified 14 chromosomal loci likely to be under positive directional selection (P < .0006; Table 1 and Figure 2). These include regions surrounding 2 SP resistance loci ( pfdhps and pfgch1) that extend to several genes, suggesting that drug selection has produced chromosomal segments of selective sweeps. We did not detect selection signals at pfcrt, but there was evidence of selection within 3 kb of the gene. This may be expected with regression back to the pfcrt wild-type alleles after withdrawal of chloroquine (CQ) from Malawi >15 years ago. Signals were also detected within 10 kb of pfubp1, a homologue of a Plasmodium chabaudi gene linked to artemisinin resistance. In Kenya, pfubp1 alleles were recently found to be under directional selection and associated with reduced in vitro susceptibility to artemisinin [31]. Positive directional selection signals were also evident in msp6 and msp3.8, pfama1, trap, and msp7 on chromosomes 10, 11, 13, and 14, respectively. These genes are expressed predominantly in merozoites and are thought to be  primarily under balancing selection, consistent with other observations in Asian and African populations [31][32][33]. These nondrug-related drivers of directional selection, as well as antigenic loci that modulate drug resistance (eg, msp3.8 and members of the clag gene family [34,35]), require further investigation. As expected, Malawian samples have the African-specific K189T mutation in PF3D7_1343700 but none of the so-called K13propeller mutations (eg, C580Y, R539T, or Y493H) associated with artemisinin resistance in Cambodia [36].
Interrogation of the allele frequency spectrum of different classes of nucleotide sites showed an excess of rare alleles (Supplementary Figure 1), with coding, nonsynonymous,   [23,37]. Use of the allele frequency-based Tajima D test to evaluate each polymorphic gene across the genome identified regions likely to be under balancing selection. There was a strong positive correlation between π and Tajima D values (Spearman ρ, 0.532), providing potential evidence of increasing diversity with balancing selection. For the 2073 genes (51.1%) with at least 5 SNPs computed for Tajima D, the majority of values (96.5%) were negative. To negate any confounding effects of population expansion and difficulties in establishing significance levels, we report 19 genes with Tajima D values of >1 (Table 2). These include a significant overrepresentation of genes (msp3.8, msp3, dbl-msp, eba-175, ama1, and surfin4.2) involved in merozoite invasion of erythrocytes and previously reported to be under immune selection [31,37,38]. The identification of additional genes potentially under immune selection may suggest additional proteins as candidate vaccine antigens.

Comparison of the Malawian Population to Other Populations
Of the 294 187 high-quality SNPs identified across the 6 populations (Malawi, 46% of SNP sites observed; Kenya, 30%; Burkina Faso, 33%; Mali, 37%; Cambodia, 22%; and Thailand, 23%), only 8% were polymorphic in Malawi. The comparison of Malawian samples to samples from each of the 5 populations, using the cross population long-range haplotype method (XP-EHH), identified regions potentially under positive directional selection at or near known drug-resistance loci ( pfdhps, pfcrt, and pfgch1; Table 3). Although low recombination rates may confound the directional selection interpretation, a follow-up analysis of genetic cross-progeny showed that these identified regions do not have low recombination rates. In Malawi, directional selection at pfdhps is probably due to high SP pressure, while pfcrt selection in Burkina Faso, Mali, Cambodia, and Thailand is likely due to high CQ pressure. Signals detected at pfcrt and pfdhps in Kenya and Thailand are also indicative of CQ and SP selection, respectively. The lack of evidence for selection at pfcrt in Malawi reflects the withdrawal of CQ and subsequent increase in the ancestral CQ-susceptible allele frequency, due to the re-expansion of a persistent minority population of CQ-susceptible parasites. This observation suggests that parasites carrying ancestral pfcrt alleles have greater fitness in the absence of CQ pressure [39].
The observed selective sweep surrounding the GTP cyclohydrolase gene (pfgch1; PF3D7_1224000) on chromosome 12 is unique to Malawi in this study. The pfgch1 gene is the first gene in the folate biosynthesis pathway, and adaptive selection could have resulted from SP pressure [40]. A similar phenomenon was previously observed in Thai parasites that evolved reduced microsatellite diversity and increased LD flanking the pfgch1 locus during SP selection [40]. Further work, such as analysis of copy number variation, may provide better insights into the selection processes at work at this locus, as positive selection may result from rapid spread of chromosomes carrying multiple copies of pfgch1 [40]. Selection signals in the trap gene are thought to reflect genetic adaptation to divergent host ligands [33] involved in the motility of sporozoites and their invasion of hepatocytes and mosquito salivary glands [41].
The populations of P. falciparum are geographically structured, resulting from adaptation to different environments and selection pressures [16,18]. Prior to analyzing sequence data for population structure, SNPs in the mitochondrial genome (mt; approximately 6 kb) were used to confirm that the Malawian samples were of African origin. Haplotypes were formed using 4 established continent-specific polymorphisms (mt772, mt1692, mt4179, and mt4952) [42]. Two haplotypes of African origin were present: CGCC (identical to 3D7) and CACC, in 90.8% and 9.2% of samples, respectively. High read-depth coverage in mt (median and mean, approximately 1560-fold and 1245-fold, which were approximately 19-fold and 23-fold greater, respectively, than the nuclear genome) was consistent with the known multiple copies of the organelle in a P. falciparum cell [43]. There was no obvious clustering of Malawian SNPs (data not shown), probably because all samples were obtained in the same season and district. A PCA of the 294 187 SNPs from all 6 populations revealed expected differences between Africa and Asia, within Southeast Asia, and within Africa (Figure 3), as previously reported [16]. We further applied the SNP-wise F ST metric to measure genomic divergence across the 6 populations. At a stringent genome-wide cutoff (F ST ≥ 0.2; top 0.5% overall), we identified the most divergent loci between Malawian samples and the other 5 populations. The frequency of alleles encoding known drug targets and their divergence between populations is shown in Supplementary Table 2 and Table 4.
The interpopulation differences at drug resistance loci likely reflect local historical parasite adaptation to drug pressure, leading to fixation or near fixation of the implicated resistance alleles, as evidenced by the following observations. First, reductions in the prevalence of pfcrt-K76T alleles after CQ withdrawal differs between Malawi (0%) and Kenya (31%), a disparity that has been observed previously [44,45]. The return of CQsusceptible malaria to Malawi has prompted discussions of the possibility that CQ, once rendered ineffective, may become useful again [46]. Second, the prevalence of pfcrt-K76T alleles differs between Mali (62%) and Burkina Faso (36%), where CQ is still used. Third, pfcrt-K76T alleles have reached fixation in Cambodia and Thailand; indeed, CQ remains the first-line treatment for Plasmodium vivax malaria in these 2 countries and thus may continue to select for the resistant genotype in P. falciparum [47]. Fourth, pfdhps-K540E alleles have reached in Cambodia, 0.261; and in Thailand, 0.264), suggesting that some signals of directional selection were detected using the population differentiation approach. However, it is possible that extreme XP-EHH could reflect demographic characteristics rather than selection pressure. Finally, a comparison of our data with those from a recent analysis from West Africa (Gambia and Guinea) highlights several key points about the P. falciparum population structure in Africa [48]. First, the predominantly negative Tajima D values in the African (Guinean and Malawian) populations indicate a historical population expansion in Africa. Second, balancing selection acts on a similar set of genes (including those predominantly expressed in merozoites and involved in erythrocyte invasion) in Guinean and Malawian samples, suggesting that it acts on similar antigenic targets irrespective of differing population demographic characteristics. For example, msp3.8 has the highest Tajima D value in both Malawi and Guinea. Third, signatures of directional selection at drug resistance loci appear to be population specific as expected, reflecting historical differences in antimalarial drug use. For example, there are strong signatures of directional selection around pfcrt in Guinea (where CQ was used until 2006) but not in Malawi (where CQ was withdrawn in 1998). In contrast, there is evidence of strong selection in pfdhps in Malawi but only weak selection in Guinea, where SP was never introduced as first-line treatment for malaria [48].

CONCLUSION
Our study demonstrates how the large and growing number of P. falciparum whole-genome sequences can be used to understand malaria biology and impact disease control. Estimates of parasite genetic diversity, LD, and MOI will improve our understanding of various aspects of malaria epidemiology, including infection dynamics, transmission levels, pathogenesis mechanisms, and drug efficacy. For example, the high nucleotide diversity, short-range LD, and MOI found in Malawian samples reflect the high transmission history of the district in which they were collected. Additionally, we have provided insights into the parasite population structure within Malawi and placed it in the context of other populations, facilitating the recognition of potentially imported cases into the country and targeting of effective malaria control measures. Our whole-genome analytical approach has detected evolutionary genetic signatures and genes under selective pressure due to drug resistance or virulence. For example, we observed differences at drug resistance loci between geographically distinct populations due to differing histories of antimalarial drug use. However, we also observed evidence of selection pressure in genes with unknown functions, which can be investigated in future experiments.

Supplementary Data
Supplementary materials are available at The Journal of Infectious Diseases online (http://jid.oxfordjournals.org). Supplementary materials consist of data provided by the author that are published to benefit the reader. The posted materials are not copyedited. The contents of all supplementary data are the sole responsibility of the authors. Questions or messages regarding errors should be addressed to the author.