Optimized Whole-Genome Amplification Strategy for Extremely AT-Biased Template

Pathogen genome sequencing directly from clinical samples is quickly gaining importance in genetic and medical research studies. However, low DNA yield from blood-borne pathogens is often a limiting factor. The problem worsens in extremely base-biased genomes such as the AT-rich Plasmodium falciparum. We present a strategy for whole-genome amplification (WGA) of low-yield samples from P. falciparum prior to short-read sequencing. We have developed WGA conditions that incorporate tetramethylammonium chloride for improved amplification and coverage of AT-rich regions of the genome. We show that this method reduces amplification bias and chimera formation. Our data show that this method is suitable for as low as 10 pg input DNA, and offers the possibility of sequencing the parasite genome from small blood samples.


Introduction
Timely detection of emerging genetic variants and other evolutionary features associated with important clinical phenotypes such as increased virulence and drug resistance are central to malaria control strategies. Genome sequencing of parasite populations has been identified as an effective tool for detecting genetic changes. 1,2 Despite the current success in the sequencing technology, there remain significant challenges in achieving global genetic surveillance of parasite populations in the field. Most genome-scale analyses, such as whole-genome sequencing, require large amounts of clean genetic material that is often difficult to obtain, 3 and therefore a serious impediment to genetic analysis on many clinical samples. A large number of valuable clinical specimens are collected in the form of small samples that yield low quantity and quality of genetic material. 4 -7 A common method for collecting clinical samples in the field is through heel/finger-pricks. 5,7 -10 However, the quantity and quality of parasite genetic material that can be extracted from these small blood samples usually fall below the threshold required by genome sequencing platforms.
To alleviate the problem of low DNA quantities, whole-genome amplification (WGA) is now routinely applied in many applications, 3,11 but has yet to be optimized for use in genomes of extreme base composition such as Plasmodium falciparum. Two major forms of WGA have been described: multiple displacement amplification (MDA) 12,13 and PCR-based amplification methods. 14,15 MDA has been the method of choice for a wider range of genome amplification studies, because it produces longer DNA products with extensive genome coverage. 16 MDA is based on w29 polymerase, which, in the presence of random hexamers annealed to denatured DNA, uses an MDA mechanism to synthesize high-molecular-weight DNA from very minute amounts of input material under isothermal conditions. 17,18 The best results, however, have been obtained from genomes with relatively balanced base composition. 11,19,20 Amplification of genomes with imbalanced base composition, such as the AT-rich P. falciparum, has remained a challenge. 21,22 In this study, we sought to identify and optimize a WGA system suitable for an AT-base-biased genome of P. falciparum. Using standard conditions as outlined for each system, we tested the efficiency of non-MDAand MDA-based methods. Initial findings showed that MDA-based systems produced a more uniform genome coverage than non-MDA methods (data not shown). We have optimized an identified MDA system to produce an improved genome coverage and a reduced base-bias with more accurate genome representation. We show that our optimized WGA conditions are suitable for as low as 10 picograms ( pg) P. falciparum input DNA, producing high-sequence concordance with unamplified genomic DNA. This development promises a significant tool to aid implementation of the global genetic surveillance of parasite populations through small blood sample sequencing.

DNA samples
Plasmodium falciparum 3D7 genomic DNA was a gift from Prof. Chris Newbold (University of Oxford). The clinical isolates were obtained from the Malaria Genetics Group's Sequencing Sample Repository at the Wellcome Trust Sanger Institute. Other genomic DNA was extracted from 17 progeny clones of P. falciparum strains derived from genetic cross between 7G8xGB4 23 and a 3D7 strain (3D7_glasgow).

Whole-genome amplification
All non-MDA WGA were performed following individual kit manufacturer's instructions. MDA-based WGA was performed using either REPLI-g Mini kit (Qiagen) or Genomiphi kit (GE Healthcare). For Genomiphi, the kit manufacturer's instructions were followed without modification. For the REPLI-g Mini kit, manufacturer's instructions were followed during preliminary tests. The following modifications were performed in developing optimized conditions for the REPLI-g Mini kit: nuclease-free water and all tubes were UV-treated before use. WGA reactions were performed in 0.2 ml PCR tubes. Buffer D1 stock solution (Qiagen) was reconstituted by adding 500 ml of nuclease-free water, and a working solution was prepared by mixing the stock solution and nuclease-free water in the ratio of 1 : 3.5, respectively. Unmodified Buffer N1 was reconstituted by mixing Stop solution (Qiagen) and nuclease-free water in the ratio of 1 : 5.7. Modified buffer N1 was prepared by including tetramethylammonium chloride (TMAC) at a concentration of 300 mM. To denature DNA templates, 5 ml of the DNA solution was mixed with 5 ml of buffer D1 (working solution prepared as described above). The mixture was vortexed and centrifuged briefly before incubating at room temperature for 3 min. Denatured DNA was neutralized by adding 10 ml of either unmodified or modified buffer N1. Neutralized DNA was mixed by vortexing and centrifuged briefly. To amplify the DNA template, denatured and neutralized sample was mixed with 29 ml of REPLI-g Mini Reaction Buffer and 1 ml of REPLI-g Mini DNA polymerase to obtain a final reaction volume of 50 ml. The reaction mixture was incubated at 308C for 16 h using an MJ Research PTC-225 thermal cycling system (GMI, Inc., USA) with the heating lid set to track at þ58C. Amplified DNA was cleaned using Agencourt Ampure XP beads (Beckman Coulter) using sample to beads ratio of 1 : 1 and eluted with 50 ml of EB (Qiagen).

Illumina library preparation and sequencing
All sequencing libraries were prepared as PCR-free. Whole-genome amplified or unamplified genomic DNA (1.5 mg in 75 ml of TE buffer) was sheared using a Covaris S2 (Covaris, Inc., Woburn, MA, USA) to obtain a fragment-size distribution of 300 to 600 bp. The sheared DNA fragments were end-repaired and Atailed using the NEBNext DNA sample preparation kit (NEB), following an Illumina sample preparation protocol. Pre-annealed paired-end Illumina PCR-free adapters were ligated to the A-tailed fragments in a 50-ml reaction containing 10 ml of DNA sample, 1Â Quick T4 DNA ligase buffer, 10 ml of PCR-free PE-adapter mixture, 5 ml of Quick T4 DNA ligase (NEB) and incubated at 208C for 30 min. The ligation reaction was cleaned twice using Agencourt Ampure XP beads (Beckman Coulter). Cleaned DNA was eluted with 20 ml of buffer EB. Aliquots were analysed using an Agilent 2100 Bioanalyzer (Agilent Technologies) to determine the size distribution and to check for adapter contamination. Samples were sequenced using either Illumina Hiseq 2500 or Miseq technologies (San Diego, CA, USA) with 75 bp read length and the paired-end read options. Corresponding WGA and non-WGA samples were run in the same lanes, using different multiplex tags. This strategy reduces potential confounding artefacts relating to sequencing chemistry.

Read mapping and genotype concordance analysis
Reads were mapped against the P. falciparum reference sequence (http://plasmodb.org/common/downloads/ release-10.0/Pfalciparum3D7/fasta/data/), using BWA (V0.6.2). We analysed the genotype calls from 17 For in silico genotyping, we used a list of 20 737 highquality single-nucleotide polymorphism (SNP) positions and alleles generated from the sequence of genetic crosses (unpublished data). We performed in silico genotyping of both the WGA and non-WGA samples using samtools mpileup and counting alleles present in at least five reads.

WGA using standard methods
We performed tests on various WGA systemsto identify those suitable for AT-rich P. falciparum genome. We compared non-MDA-Endcore-Rapid (NuGen), MALBAC (Yikon Genomics), Rapisome (Biohelix) and MDA-Repli_g (Qiagen) and Genomiphi (GE Healthcare) systems. Our preliminary data (not shown) indicated that non-MDA methods were less tolerant to the ATbiased genome compared with their MDA counterparts. With non-MDA amplification, we observed excess bias towards regions with high GC content and poor coverage of high ATregions of the genome resulting into high levels of allele dropout. Further optimization focused on the MDA systems. Repli_g and Genomiphi are the most commonly used methods for MDA. Although both kits have shown similar results in many systems, 24,25 the majority of studies have been with genomes of relatively balanced base composition. Here, we have assessed the performance of Repli_g and Genomiphi by amplifying an ATrich P. falciparum genome. Amplification of pure P. falciparum 3D7 genomic DNA by both kits produced very similar results as assessed by genome coverage analysis metrics. However, amplification of host-contaminated clinical samples produced dissimilar results in terms of base composition and overall genome coverage ( Fig. 1). Whereas amplification products generated by both Repli_g and Genomiphi did not maintain the exact original host/parasite DNA proportions, Genomiphi products were more biased towards the host DNA ( Fig. 1), an observation suggesting bias towards templates of neutral base composition originating from the host genome. Based on the level of bias, we chose to optimize Repli_g for amplification of the P. falciparum genome.

Optimized WGA conditions reduce amplification
bias and improve coverage on low complexity regions Non-coding regions of P. falciparum DNA contain 90% AþT base composition. Amplification of AT-rich genomes is a challenge to almost all commercially available polymerases. We have previously shown that addition of TMAC improves coverage of low GC regions of the genome during PCR, 22 but the same has not been tested with w29, the MDA polymerase. To investigate the effect of TMAC on MDA, we amplified P. falciparum 3D7 genomic DNA of varied input amounts (0.1-2 ng) both in the presence and absence of TMAC. We compared the quantity and quality of both products and observed that, like many commercial PCR polymerases, w29 is inhibited by TMAC at certain levels of concentration. We have determined 60 mM as a concentration that is non-inhibitory to the polymerase and optimal for WGA of an AT-biased genome. Although the quantity of the amplification product was higher in the absence (standard procedure) than in the presence of 60 mM TMAC (data not shown), the quality of MDA product, in terms of coverage and base composition, was improved in the optimized procedure where TMAC was added (Figs 2 and 3). As shown in Fig. 2, amplification using a standard protocol (Std) resulted in excessive bias in regions of low complexity. Inspection of the over-amplified regions reveals sequences of low complexity and numerous repeat patterns. We used a tandem repeat finder programme 26 and revealed numerous sequences in tandem repeat conformation that may have affected amplification bias. The top three tandem repeats are provided in Supplementary Table S1. Unlike the standard WGA protocol, our optimized amplification procedure abolished excessive amplification bias and produced a more uniform coverage. The mechanism by which this bias is corrected is not clear, but it is conceivable that TMAC stabilizes and stiffens the DNA backbone, thereby minimizing cis-priming by the looping of displaced DNA strands during amplification. We also show that base composition bias increased with a decrease in the amount of input DNA during amplification with standard protocols. This bias was not observed in samples amplified with the optimized (Opt) conditions (Fig. 3).

3.3.
Optimized conditions reduce chimera formation and maintain template base composition during MDA by w29 Formation of chimeric reads is a major problem with MDA technology. 27 Chimeras cause serious mapping and assembly problems, and therefore reduce the quality and quantity of the WGA product. We analysed the formation of chimeras by comparing WGA products following standard and optimized procedures. The number of chimeric reads increased as the amount of input DNA was reduced (Fig. 4). However, optimizing WGA condition by including TMAC additive decreased the formation of chimeras and improved the quality of reads in P. falciparum WGA.
3.4. WGA from 10 pg of input P. falciparum genomic DNA Most studies with w29 MDA have used an input DNA of !10 ng for WGA. 19,20 In many cases, this amount may be difficult to obtain from valuable clinical specimens. We set to find out the minimal amount of parasite DNA that can be successfully amplified by the optimized conditions to obtain uniform genome coverage. We performed WGA on P. falciparum genomic DNA with an input amount ranging from 2 ng down to 100 femtograms (fg). WGA products were multiplexed and sequenced using Illumina MiSeq or HiSeq 2500 machines. Sequence reads generated were analysed to determine the minimum threshold required to produce optimal coverage suitable for various whole-genome studies including genotyping by SNP analysis. To assess the quality of the sequence data generated from each input amount, reads were mapped to the reference genome using BWA. CallableLoci program of the genome analyser tool kit (GATK) 28 was used to inspect and count the proportion of the genome with high-quality base coverage (callable  Whole-Genome Amplification for AT-Rich Templates [Vol. 21, bases), positions of the genome with zero coverage (uncovered bases) and the size of coverage gaps. Using these metrics, we show that the number of callable loci (high-quality bases) remained relatively high for samples with input DNA ranging from 2 ng to 10 pg. However, the quality of sequence data dropped sharply for samples with input DNA ,10 pg (Fig. 5). A similar trend was observed with the proportion of gap sizes (length of uncovered bases, Fig. 5 bottom panel) and chimeric reads (Fig. 4), where a sharp increase in these values was observed for samples with input DNA ,10 pg. From these observations, we concluded that 10 pg is the minimal amount of P. falciparum input DNA that produces quality genome coverage under the optimized MDA conditions described. This amount of DNA is equivalent to only 380 parasite genomes.
3.5. Detailed analysis of WGA products from a 10-pg input template DNA Standard WGA methods produce products that show sequence representation bias, allelic dropout and amplification artefacts. These problems tend to increase as the amount of input template decreases. To adequately genomic DNA (ranging from 0.1 to 2 ng as shown in sample names) were used as an input template for amplification by Repli_g following the standard or optimized procedure. Amplification products were sequenced as PCR-free and reads obtained were analysed for GþC content profile. A non-WGA sample (gDNA) shows a GC content of 19.4%, a profile that was closely matched by products amplified using the optimized procedure (Opt). Samples amplified following the standard procedure (Std) showed biased GC content shown as a shift towards the neutral base composition. falciparum 3D7 genomic DNA (ranging from 2 ng to 100 fg as shown in sample labels) were used as an input template for amplification by Repli_g following the standard or optimized procedure. Amplification products were sequenced as PCR-free, and reads obtained were normalized and analysed for the presence of chimeric reads. A non-WGA sample (control) showed the least number of chimers. Using the standard procedure, the number of chimers increased with a decrease in the amount of input DNA. Samples amplified using the optimized procedure showed a decrease in chimer formation that remained low and steady from 2 ng to 100 pg of input DNA. Figure 5. Determining the lowest threshold of input DNA mass for WGA. Different amounts of P. falciparum 3D7 genomic DNA (ranging from 2 ng to 100 fg, as shown in sample names) were used as an input template for amplification by Repli_g following the standard or optimized procedure. Samples were multiplexed and sequenced using a fast turn-around Illumina Miseq machine. Sequence reads mapped to the reference were normalized and analysed for coverage and base quality using the 'CallableLoci' program of GATK. A non-WGA sample was used as an unamplified control. Input quantities ranging from 2 ng down to 10 pg produced reads with high-quality 'callable' bases covering over 60% of the genome. Input DNA below the10-pg threshold produced poor base quality with only ,30% of genome covered with 'callable' bases. Input DNA ,10 pg showed a sharp increase in positions with zero coverage (bases uncovered) and an increase in gap sizes in genome coverage. evaluate the quality of the WGA products produced from an input of 10 pg-1,000-fold smaller than the standard input amounts-we performed analysis on genomic DNA extracted from P. falciparum strains derived from the progeny clones of genetic cross between 7G8xGB4 laboratory strains, as well as the 3D7 Strain. Genomic DNA extracted from 17 progeny clones was sequenced both as WGA and as bulk genomic DNA (non-WGA). We performed detailed analysis by comparing sequence data generated from WGA and their matching unamplified genomic DNA (non-WGA). WGA and non-WGA samples yielded a median of 3.5 and 3.4 billion base sequences, respectively, with 94.8 and 95.3% of the read mapping to the reference sequence (Table 1). WGA samples showed coverage between 180Â and 500Â, whereas non-WGA samples showed coverage between 90Â and 250Â. A median of 1.5 and 1.2% of genome bases was not covered in WGA and non-WGA samples, respectively (Table 1).

Genotype concordance analysis
To evaluate the fidelity of sequence representation by WGA, we analysed the genotype calls from 17 cross samples and the parent reference strain. We performed de novo and in silico genotype calls on sequence generated from both the WGA and non-WGA samples. Genotype concordance was determined by comparing SNP and InDel calls from matched pairs of both non-WGA and WGA samples.

Concordance of de novo SNPs and InDel
calls For each pair of WGA and non-WGA sample, we generated de novo variation calls simultaneously. A median of 20,338 biallelic de novo SNPs with a quality score of !250 were called for each sample. For the 3D7 reference strain (3D7_Glasgow), only 185 de novo SNPs were called (Supplementary Table  S2). Both of these numbers are well within the expected range for this method. 1 As given in Table 2, call pairs were grouped into 'Perfect Concordance', 'WGA New Alleles' and 'Undetermined'. Identical calls in both non-WGA and WGA samples had a median of 97.3% and a mean of 96.7% + 1.2 that are in perfect concordance, and new alleles present only in the WGA samples (considered as WGA-introduced alleles) had a median of 1.2% and a mean of 1.8 + 1.1%. Most of the WGAintroduced alleles were due to a mixed call in WGA, where the non-WGA counterpart only showed the reference allele. Discordant single-allele calls were extremely rare with a median of zero SNPs and a total  Table S2a). Additionally, de novo SNP calls in 5,075,789 well-covered coding positions (Table 2) Table S2b). Calls that were grouped as 'Undetermined' represent alleles that were missing or wrong alleles ( present in the non-WGA samples but are neither the reference nor the alternative allele).
As with SNPs, we also performed biallelic de novo InDel calls with a quality score of !250 for both WGA and non-WGA samples. As given in Table 3, InDel calls from the whole genome with perfect concordance between WGA and non-WGA samples had a median of 97.6% and a mean of 96.8 + 1.7%, and WGA-introduced InDels had a median of 1.5% and a mean of 2.4 + 1.8%. Similarly, de novo InDel calls in the coding regions only produced near-perfect concordance with a median of 98.7% and a mean of 98.2 + 1.4%. WGA-introduced InDels in these regions dropped to a median of 0.4% and a mean of 1.4 + 1.3%.
3.6.2. Concordance of in silico genotyping Using a list of high-quality SNP positions and alleles from the genetic crosses, we performed in silico genotyping of both the WGA and non-WGA samples on 20,737 positions, counting alleles present in at least five reads. Call comparisons between the WGA and non-WGA from same samples were grouped into 'Perfect Concordance (identical)', 'WGA Missing Alleles', 'WGA New Alleles' and 'Undetermined' (missing allele). Identical alleles with perfect concordance between WGA and non-WGA samples had a median of 97.95% (Fig. 6). A median of 0.48% calls represent alleles that were called in the WGA, but not in the non-WGA, samples. These reflect cases where the WGA sample had a mixed call and the non-WGA sample had a single-allele call. A median of 0.94% represent alleles that were missing in WGA, but were called in non-WGA samples. Discordant single-allele calls were extremely rare, with a median of zero and a mean of  Table S4).

Discussion
High-throughput DNA sequencing technologies have gained a wide range of applications in infectious disease research, including global surveillance of the emergence and spread of drug resistance, detection of regions of the genome under selection and identification of genetic determinants of clinical phenotypes through genomewide association studies. The P. falciparum genome presents inherent technical challenges to next-generation DNA sequencing, such as its extreme AT-richness. Whole-genome sequencing of P. falciparum field isolates poses additional difficulties due to the small quantity of DNA often retrieved from patients in the clinical setting and the high degree of host DNA contamination from leucocytes. This study addressed the first of these key problems by developing a method for WGA capable of producing high-quality sequence data from just 10 pg of input P. falciparum DNA.
Although PCR-based WGA techniques, such as primer extension PCR, ligation-mediated PCR and degenerate oligonucleotide-primed PCR, have been successfully used in some studies such as single-cell amplification, 29 their wider application has been limited. PCR-based WGA methods produce relatively shorter products, nonspecific amplification artefacts and incomplete genome coverage. 19,20,30 MDA has been associated with base-bias and generation of chimeric products. 29 Nonetheless, MDA has been the method of choice for a wider range of genome amplification studies, because it produces longer DNA products with extensive genome coverage. MDA also produces higher DNA yields with relatively less amplification bias. 3,11,19 Here, we have assessed MDA on an AT-rich genome using a range of input DNA quantities. We describe an optimized WGA method incorporating TMAC reagent that improves amplification coverage of the difficult AT-rich loci of the genome. We establish 10 pg of input DNA as the lowest threshold from which our optimized WGA protocol generates an amplification product with optimal P. falciparum genome coverage for most genome sequencing analysis. This amount equates to 380 parasite genomes, equivalent to 1 ml of blood in a patient with 0.01% parasitemia. Furthermore, we show that the optimized conditions significantly reduce the formation of chimeric reads, thereby improving the overall quality of the amplified product.
Standard WGA from such small quantities of input material is often associated with bias and incomplete genome coverage. A single-cell genomic approach that uses infected red blood cell sorting technology has recently been reported that achieved a genome coverage of 50% with a standard WGA method. 31 Although genome coverage in single-cell approach is still low, this technology has opened avenues for single-cell genomics studies in malaria and offers great opportunities for dissecting multiple genotype infection. Our optimized WGA method described here will be useful for optimizing genome coverage in such single-cell genome amplifications, as well as direct field applications for small sample sequencing. For improved malaria clinical sequencing, we routinely employ the combination of host depletion methods 32,33 and the current optimized WGA procedure to generate high-quality whole genome sequencing data.
We have assessed the quality of the amplified products generated from 10 pg input DNA using high-throughput sequencing of DNA extracted from 17 progeny clones derived from genetic crosses between two laboratory strains. We performed a comparative analysis between the WGA data and their corresponding non-WGA counterparts. We show that coverage was sufficient for allele calling and other whole-genome analyses. Although the number of uncovered bases was slightly higher in WGA (median, 1.5%) than non-WGA (median, 1.2%) samples (Table 1), the difference was insignificant (t-test, P ¼ 0.053, 95% CI). The same applies to preference for mitochondria DNA amplification, which showed a median of 2.6% in WGA against 0.6% in the non-WGA samples.
Another key aspect of P. falciparum whole-genome sequencing from clinical specimens is removing host DNA contamination. This can be achieved either at the blood sample processing stage through leucocyte depletion or through selective enrichment of parasite DNA after extraction. 32 -34 The combination of an effective method for removing human DNA that is applicable to the field setting, and the ability to perform whole-genome sequencing from very low quantities of input DNA as described in this study, has the potential to greatly increase the scope and scale of P. falciparum genomic research. This, in turn, would contribute significantly to malaria genetic surveillance and control strategies.

Conclusion
The optimized amplification conditions described here have generated high-quality whole-genome sequence data (99.8% genome coverage) from a minute amount of input DNA, equivalent to ,400 P. falciparum genomes. This work shows for the first time that accurate in silico genotyping and de novo calling of genetic variants is achievable on a WGA sample using ,1 ng of input DNA from an extremely AT-rich genome. We anticipate that sequencing from small quantities of input DNA (,1 ng) will become a significant aid to genetic and genomic studies of P. falciparum in the field, particularly when combined with effective methods for removal of host DNA contamination.