Quantifying Replication Slippage Error in Cryptosporidium Metabarcoding Studies

Abstract Genetic variation in Cryptosporidium, a common protozoan gut parasite in humans, is often based on marker genes containing trinucleotide repeats, which differentiate subtypes and track outbreaks. However, repeat regions have high replication slippage rates, making it difficult to discern biological diversity from error. Here, we synthesized Cryptosporidium DNA in clonal plasmid vectors, amplified them in different mock community ratios, and sequenced them using next-generation sequencing to determine the rate of replication slippage with dada2. Our results indicate that slippage rates increase with the length of the repeat region and can contribute to error rates of up to 20%.

Replication slippage, also known as slipped strand mispairing, is a natural mutation process that occurs during DNA replication.It involves the misalignment of DNA strands, leading to the addition or deletion of a repeated sequence of nucleotides in a specific region of the DNA [1].This phenomenon is particularly common in repetitive DNA sequences, such as microsatellites or short tandem repeats (STRs), which are stretches of DNA where a short sequence of nucleotides is repeated multiple times in a row.This phenomenon has implications for genetic diversity, evolution, and the development of certain genetic diseases [2] and can be measured by amplification and sequencing of the repeat region.
The most commonly used marker for characterizing diversity in intestinal protozoan parasite Cryptosporidium is the 60 kDa glycoprotein (gp60) gene, which contains nucleotide sequence differences, trinucleotide repeats, and other repeat regions [3].The trinucleotide repeats encode serines that are used to classify subtypes within subtype families.Sanger sequencing is sufficient for identifying the dominant subtype family in a sample, but recent studies using next-generation sequencing (NGS) have revealed hitherto unrecognized diversity in Cryptosporidium infections and provided valuable insights into the epidemiology and biology of these parasites [4,5].Outbreaks seldomly appear to be caused by a single Cryptosporidium subtype, and host coinfection with multiple subtype families or even by different species appears to be possible [6].However, amplification of short repeats in gp60 amplicons is susceptible to DNA polymerase slippage during polymerase chain reaction (PCR) elongation steps [7].This makes it difficult to distinguish true diversity (and possible rare or emerging Cryptosporidium subtypes with epidemiological relevance) from artifacts resulting from sequencing errors including replication slippage during PCR amplification or contamination.
Previous studies have evaluated the suitability of gp60 metabarcoding [6] using samples naturally infected with Cryptosporidium.However, the lack of appropriate pure culture controls, time-consuming procedures (such as amplicon cloning) on large numbers of isolates, and inaccuracies in laboratory and sequencing processes have hampered efforts to quantify error rates and remove experimental noise from meaningful results.Therefore, directly assessing error rates associated with Cryptosporidium gp60 gene metabarcoding has not previously been undertaken.To address this issue, we designed and synthesized Cryptosporidium gp60 gene fragments that match subtypes from Cryptosporidium hominis (IbA10G2) and Cryptosporidium parvum (IIdA19G1).We then conducted mock community pre-and post-PCR experiments to assess the accuracy of metabarcoding for assessing Cryptosporidium diversity.

METHODS
Synthesized material was created by Genescript, inserted into cloned vectors (pUC57), and extracted to provide a pure culture of each subtype with a total length of 3212 bp for C hominis and 3171 bp for C parvum, due to the different sizes of each species-specific insert.Both insert sequences contained modifications to the original template sequence to distinguish them from potential contaminating Cryptosporidium DNA.The C hominis IbA10G2 insert was 502 bp and contained a repeat region with 13 serines, whereas the one for C parvum IIdA19G1 was 461 bp and had 20 serine repeats.The synthetic sequences were used to create mock communities for testing sequencing error rates as well as variation and biases in PCR.Extracted plasmid samples (see Supplementary Information) were quantified using Qubit 1X dsDNA HS Assay Kit (Thermo Fisher, Waltham, Massachusetts) and diluted to normalized concentrations.Copy number calculations were performed to account for the slightly larger C hominis amplicon and resulted in a known number of gp60 copies added to each PCR reaction (2 × 10 6 ).Nested PCR amplification of extracted products followed previous protocols [5] and used Platinum Taq DNA polymerase (Invitrogen).The internal primers were modified to contain MiSeq adapter sequences on the 5′ end according to standard protocols [8].PCR clean-ups used Ampure beads and ethanol washes.Cleaned samples were quantified using Qubit and normalized to approximately 5 ng/µL.The slight size difference of amplicons (C hominis was 41 bp longer) was accounted for by calculating and equalizing the amplicon copy number such that a known number of gp60 copies was present per microliter of C hominis and C parvum samples.
Sequencing was carried out on an Illumina MiSeq using 500-cycle V2 chemistry, according to the manufacturer's recommendations, producing 2 × 250 base paired-end reads.An Illumina PhiX control library was loaded onto the run at 20% volume, to even out the base composition and prevent biases.All samples have been deposited to the National Center for Biotechnology Information Sequence Read Archive under Bio-project PRJNA1013320 with accession numbers SAMN37295715-SAMN37295752.
Reads were trimmed using DynamicTrim application (http:// solexaqa.sourceforge.net/)with the quality cutoff set at 0.01.To examine variation in the repeat region only, forward reads were trimmed to minimum length of 200 bp and mapped to references based on synthetically designed C hominis and C parvum sequences using Geneious (v.10.2.6) with low sensitivity.The observed proportion of reads belonging each species were compared to the expected ratios and tested using goodness-of-fit test with 99% confidence level.The contig(s) were then further trimmed to only include the serine repeat regions, then the length distribution of all reads was recorded.
To compare with previous studies, we also analyzed the data using amplicon approaches on the paired reads.The Illumina sequence reads for the 36 samples involved in this study were analyzed using R (v.4.2.2) and dada2 (v.1.26.0) [9], tidyverse (v.1.3.2),ggplot2 (v.3.4.2), phyloseq (v.1.42.0), and ShortRead (v.1.56.1) packages.A default dada2 pipeline approach was taken to filter and trim the forward and reverse sequence reads, de-replicate them, calculate and plot error rates, merge paired reads and construct a sequence table, and remove chimeras.We initially ran dada2 with default settings and then with the BANDSIZE value set to "2," whereas the default value is "16." The change in the BANDSIZE was chosen to be the largest value less than the triplet size, so that individual triplets in the repeat region resulted in discrete amplicons being kept within the analysis.Code is available on Github (https://github.com/pjbiggs/CryptosporidiumSlippage).In addition, the analysis was performed with the 20 most abundant amplicons, which accounted for 99.7% of the sequence reads in the study.

RESULTS
Average read counts for each sample type are recorded in Table 1.Cross-contamination rates between species were <0.01%.Post-PCR combinations of C hominis IbA10G2 and C parvum IIdA19G1 templates matched the expected proportions after classification with map to reference analyses using Geneious.In contrast, the percent read counts for the mixed community amplifications (pre-PCR) did not match the template percentages in samples that were unbalanced-that is, the 75:25 and 99:1 combinations.In lower template concentrations, both C hominis and C parvum templates were overrepresented by between 18.6% and 33.8% (P values <.001 from χ 2 test; Figure 1, Table 1).Serine repeat regions varied in length in both templates and consistently across all sample types in the Geneious analysis.In the C hominis template, 95.5% of reads were of the expected size, with 3.8% being 1 serine (3 bases) shorter and 0.7% being 1 serine longer.The same pattern was seen in C parvum, only more pronounced, with 80% of reads of the expected size, 16.6% being 1 serine (3 bases) shorter, 1.8% being 2 serines (6 bases) shorter, and 1.6% being 1 serine longer (Supplementary Figure 1).The dada2 pipeline approach generated the same results as above, with higher rates of replication slippage in the C parvum amplicon and overrepresentation of lower concentration templates in the pre-PCR samples (Figure 1).

DISCUSSION
Our results demonstrate that contamination from different subtype families was absent in the metabarcoding and that PCR and NGS recover the sequences inputted, but we also detected a limited but detectable amount of PCR slippage.The use of a synthetically designed template allowed a direct measurement of polymerase slippage artifacts in the serine repeat region.This region is used for classifying Cryptosporidium subtypes, depending on the number and variety of serine repeats.We observed slippage in both templates, with a consistently higher rate in the C parvum amplicon.Since our template samples had a different number of repeats (13 in C hominis and 20 in C parvum), this corroborates observations in previous studies where slippage frequency is directly increased with the length of the serine repeat region [10].Our study used Platinum Taq DNA polymerase, which has 1X fidelity versus Taq; however, future studies could test results using higher-fidelity enzymes.Nonetheless, estimations of the expected rates of error in the serine repeat region can be extrapolated from our data, are in agreement with previously observed patterns for slippage rates in trinucleotide repeat regions [11], and may be further refined with future research using different-sized repeat regions relevant to different Cryptosporidium species and subtypes.
As anticipated, the post-PCR combinations of C hominis and C parvum PCR products closely matched the expected ratios.However, we detected a marked bias toward rare taxa in the pre-PCR communities.For example, in the 99:1 scenario, the observed number of reads was approximately 70:30.A bias toward rare or low copy number taxa has been observed previously in nested PCR studies when more cycles were applied in the first round of PCR on relatively highly diverse communities [12].However, in more complex assemblages these patterns are less predictable [13,14] and may also vary in real samples with much lower amounts of Cryptosporidium DNA template as well as PCR inhibitors.Amplicon slippage rates did not differ under any of the different treatments, including those with lower initial template concentrations.
Our results were similar using dada2 and sequence analyses by Geneious, but it is important to note that under the default dada2 settings, many of the variants present in the samples were forced to cluster together around the most common fragment length(s).The degree of clustering was dependent on the variation in the community, with lower diversity-that is, 1 subtype more prone to clustering effects.The appropriate BANDSIZE settings must be used, based on expected error rates given the size of the repeat region.This will help past and future studies separate real, biological variation inside the host from artifacts of the PCR and sequencing process.
It is possible that the replication slippage rates observed in the study are not attributable to PCR error alone.Since the synthetically generated templates used in the study were inserted into the pUC57 plasmid, slippage DNA replication as part of normal cloning might have contributed to the observed results as well as during PCR.However, replication slippage during cloning is expected to be low and previous observations using Cryptosporidium templates suggest low error rates in templates with clonal plasmid vectors [10].
While our analyses were conducted on synthetic DNA, it is important to note that slippage is a natural phenomenon.This process holds the potential to foster repeat variations in specific sites, thereby aiding Cryptosporidium in evading immune responses triggered by prior infections.Despite recent research demonstrating that the repeat regions in Cryptosporidium gp60, on which our study is based, do not appear to impact antibody recognition e146 • JID 2024:230 (15 July) • BRIEF REPORT [15], our research has implications for studies assessing the diversity of PCR-amplified repeat regions.
Our study shows that PCR slippage during amplification leads to significant error rates in metabarcoding sequencing of amplicons with repeat regions.The rate of error appears proportional with size of the repeat region.In addition, our study shows a tendency for overrepresentation of rare taxa in mixed assemblages.Contamination from different subtype families was absent.Taken together, our results indicate that many subtypes seen in metabarcoding studies are likely artifacts resulting from PCR slippage during amplification and may be distributed in predictable ways, enabling better interpretation of results.Mixtures of different subtype families and species in Cryptosporidium metabarcoding studies do likely therefore represent real within-sample variation.

e144•
JID 2024:230 (15 July) • BRIEF REPORT The Journal of Infectious Diseases B R I E F R E P O R T

Figure 1 .
Figure 1.Amplicon length proportions for combinations of species (on right of charts where ratios refer to different combinations of Cryptosporidium hominis and Cryptosporidium parvum template).Post-and pre-polymerase chain reaction (PCR) treatments refer to when the C hominis and C parvum template combinations occurred.The expected trimmed amplicon length was 391 and 349 bp for C hominis and C parvum, respectively.

Table 1 . Polymerase Chain Reaction Experimental Design and Read Count Results (Average of 3 Replicates) From Map to Reference Analyses of Forward Reads
Abbreviations: C hominis, Cryptosporidium hominis; C parvum, Cryptosporidium parvum; N/A, not applicable; PCR, polymerase chain reaction.a Samples were derived from amplicons generated from samples 1-6 and combined before library preparation.b Indicates P < .001from χ 2 goodness-of-fit test using expected and observed percent read counts.