Abstract

Background

Cryptosporidium parvum is an apicomplexan parasite commonly found across many host species with a global infection prevalence in human populations of 7.6%. Understanding its diversity and genomic makeup can help in fighting established infections and prohibiting further transmission. The basis of every genomic study is a high-quality reference genome that has continuity and completeness, thus enabling comprehensive comparative studies.

Findings

Here, we provide a highly accurate and complete reference genome of Cryptosporidium parvum. The assembly is based on Oxford Nanopore reads and was improved using Illumina reads for error correction. We also outline how to evaluate and choose from different assembly methods based on 2 main approaches that can be applied to other Cryptosporidium species. The assembly encompasses 8 chromosomes and includes 13 telomeres that were resolved. Overall, the assembly shows a high completion rate with 98.4% single-copy BUSCO genes.

Conclusions

This high-quality reference genome of a zoonotic IIaA17G2R1 C. parvum subtype isolate provides the basis for subsequent comparative genomic studies across the Cryptosporidium clade. This will enable improved understanding of diversity, functional, and association studies.

Introduction

Cryptosporidium is an apicomplexan parasite of public health and veterinary significance with a recent analysis reporting a global infection prevalence of 7.6% [1]. Historically, limited government and private funding was available to study the epidemiology and molecular dynamics of the organism, but this has recently shifted [2].

Cryptosporidium spp. have been found in 155 species of mammals, including primates [3,4]. Among humans, 20 species of Cryptosporidium spp. have been identified [5]. Although the parasite can be transmitted in a variety of ways, the most common method is via drinking and recreational waters. In the United States, Cryptosporidium is the most common cause of waterborne disease in humans [6]. Studies have shown that Cryptosporidium is responsible for a large proportion of all cases of moderate-to-severe diarrhea in children younger than 2 years [7,8]. There is currently no vaccine available, and the only approved drug for the treatment of Cryptosporidium-related diarrhea is nitazoxanide, which has limited activity in immunocompromised patients.

Previously, the inability to complete the life cycle of Cryptosporidium in vitro hampered progress in understanding pathogenesis and exploring new treatment modalities. Recent advances using human organoids support the full parasite life cycle, recapitulate in vivo physiology of host tissues [9–12], and provide a way to study the molecular mechanisms and pathways used by Cryptosporidium during infection. However, to facilitate genomic or association studies, a high-quality reference genome is needed.

Cryptosporidiumparvum (NCBI:txid5807) was included in early genome-sequencing projects owing to its public health importance and high global prevalence. The first reported complete genome assembly for C. parvum Iowa II became available in 2004 [13], generated by random shotgun sequencing approach, resulting in ∼13× genome coverage totaling 9.1 Mb of DNA sequence across all 8 chromosomes. This reference sequence had a reduced coverage across the genome, with multiple gaps, and was not adequate to represent the full breadth of genes present, which could result in misleading interpretations of the isolates being studied. In addition, online repositories such as GenBank, CryptoDB, and the Wellcome Trust Sanger Institute FTP servers provide a range of unassembled, unprocessed raw read sequences.

Long-read sequencing technology has advanced to enable read lengths of 15–20 kb (Pacific Biosciences) and 2–3 Mb (Oxford Nanopore Technologies [ONT]) with low error rates and is frequently used to improve reference genome assembly [5,14–19], thus enabling long continuous assemblies without gaps even across highly repetitive regions [20]. While long-read technologies enable an improved assembly, it is difficult to evaluate which de novo assembly best represents the sample. Currently, the simplest way to rank de novo assemblies is by length [20] (N50) or BUSCO (BUSCO, RRID:SCR_015008) [21] comparison. However, this is not a guarantee that chromosomes are well represented or correctly arranged. Furthermore, the variety of de novo assembly methods (e.g., Canu [Canu, RRID:SCR_015880] [22], Flye [Flye, RRID:SCR_017016] [23], Shasta [24], Falcon [25]) makes it harder to choose the best representation.

In the present study, we have generated a reference genome for C. parvum by using long-read sequencing on the ONT PromethION (PromethION, RRID:SCR_017987) supplemented with short-read data generated on NovaSeq 6000 (Illumina NovaSeq 6000 Sequencing System, RRID:SCR_016387) for error correction (see Fig. 1). This resulted in a complete reference including all chromosomes and thus represents a gapless representation of this important pathogen. Furthermore, it includes 13 of 16 telomeric sequences. The assembly is available at PRJNA744539 (GCA_019844115.1). In addition to the novel assembly, we lay out our quality control process and assessment of the assembly not only to optimize for length but also to assess the overall structure of the draft assemblies. Following this comparison schema, it is easy to choose the most optimal representation. In addition, this schema is applicable for other species as well, from single haploid to more complex organisms such as plants or humans.

: Workflow for the generation of Cryptosporidium parvum assembly.
Figure 1

: Workflow for the generation of Cryptosporidium parvum assembly.

Results

We sequenced the C. parvum genome with ONT long reads (see Methods) and obtained a total of ∼480 Mb of sequence (Fig. 1). This is equivalent to 53× coverage for this genome (∼9 Mb genome size). Figure 2 shows overall statistics on read length and coverage. The N50 read length is 15.3 kb with 10× coverage of reads with ≥30 kb length. Our longest read detected was 808 kb. In addition, we sequenced the genome using the Illumina NovaSeq 6000 to produce 352× coverage of 150-bp paired-end reads.

: Read length distribution and cumulative coverage over the Oxford Nanopore sequencing. We obtained a total of 53× coverage with long reads and even 10× coverage with reads larger than 30 kb (x axis). The longest read measured was 808 kb.
Figure 2

: Read length distribution and cumulative coverage over the Oxford Nanopore sequencing. We obtained a total of 53× coverage with long reads and even 10× coverage with reads larger than 30 kb (x axis). The longest read measured was 808 kb.

Using these short reads we ran a genome estimation using GenomeScope (GenomeScope, RRID:SCR_017014) [26] to obtain a genome size estimate using a polyploidy of 1. Doing so resulted in an estimate of 9.9 Mb with an 89.24% model fit (see Supplementary Fig. S1). Inspection of the resulting data ( Supplementary Fig. S1) highlights that this is a potential overestimation of the genome size itself and thus fits in the realm of the previously reported reference assembly in CryptoDB (GCA_015245375) of ∼9.1 Mb.

Assembly and comparison of Cryptosporidium assembly

The initial assembly was carried out with only the ONT reads using Canu [22] (see Methods) and resulted in 25 contigs with eight contigs representing all chromosomes. We obtained a total genome length of 9.19 Mb across eight assembled contigs with an average N50 size of 1.11 Mb (Table 1). The largest contig was 1.4 Mb. Our assembly shows a NG50 similar to that of the assembly published in 2004 (see Fig. 3A).

: Assembly comparisons. (A) The Canu assembly shows a high concordance with the previously published C. parvum assembly (GCA_015245375.1) [27] (dot plots) and agreements in length (bottom). Nevertheless, clear assembly differences are visual when comparing it to GCA_000165345.1 [13]. (B) The Flye assembly vs the C. parvum assembly (GCA_015245375.1) shows large disagreements. Contig 3 is merged between 2 different Cryptosporidium chromosomes, and 1 chromosome is missing. Also, the length comparison (bottom) shows discrepancies in the beginning, highlighting a very short contig in the end (green track). Interestingly GCA_000165345.1 shows structural differences over both assemblies, likely indicating errors in the previous reference.
Figure 3

: Assembly comparisons. (A) The Canu assembly shows a high concordance with the previously published C. parvum assembly (GCA_015245375.1) [27] (dot plots) and agreements in length (bottom). Nevertheless, clear assembly differences are visual when comparing it to GCA_000165345.1 [13]. (B) The Flye assembly vs the C. parvum assembly (GCA_015245375.1) shows large disagreements. Contig 3 is merged between 2 different Cryptosporidium chromosomes, and 1 chromosome is missing. Also, the length comparison (bottom) shows discrepancies in the beginning, highlighting a very short contig in the end (green track). Interestingly GCA_000165345.1 shows structural differences over both assemblies, likely indicating errors in the previous reference.

Table 1

: Overall assembly statistics and comparison using Quast (QUAST, RRID:SCR_001228) between the present assembly and the previously established assembly

 StatisticGCA_000165345.1GCA_019844115.1 (newly established)
Total sequence length9,102,3249,197,619
Total ungapped length9,087,6559,197,619
Unresolved sequences14,6690
N501,104,4171,108,772
N90985,969993,129
L5044
Total No. of chromosomes88
 StatisticGCA_000165345.1GCA_019844115.1 (newly established)
Total sequence length9,102,3249,197,619
Total ungapped length9,087,6559,197,619
Unresolved sequences14,6690
N501,104,4171,108,772
N90985,969993,129
L5044
Total No. of chromosomes88
Table 1

: Overall assembly statistics and comparison using Quast (QUAST, RRID:SCR_001228) between the present assembly and the previously established assembly

 StatisticGCA_000165345.1GCA_019844115.1 (newly established)
Total sequence length9,102,3249,197,619
Total ungapped length9,087,6559,197,619
Unresolved sequences14,6690
N501,104,4171,108,772
N90985,969993,129
L5044
Total No. of chromosomes88
 StatisticGCA_000165345.1GCA_019844115.1 (newly established)
Total sequence length9,102,3249,197,619
Total ungapped length9,087,6559,197,619
Unresolved sequences14,6690
N501,104,4171,108,772
N90985,969993,129
L5044
Total No. of chromosomes88

We also generated an assembly with Flye assembler [23] (see Methods), which led to a total of seven contigs. However, one contig was only 62,160 bp long (see Fig. 3B). Despite this early warning sign, we compared the two assemblies to identify which one best represented the C. parvum genome using genome alignments and remapping of short reads.

To validate our findings, we first aligned the Canu and Flye assemblies to the previously published C. parvum genome reference [3] using nucmer [28] (v3.23). The nucmer alignments were filtered by “-l 100 -c 500 -maxmatch” for all assemblies following the suggestions from Assemblytics [29], which was used to study the alignment results that were generated (Fig. 3).

The dot plot from a MUMmer (MUMmer, RRID:SCR_018171) alignment analysis indicates that the GCA_015245375.1 [27] and Canu genome assemblies are largely collinear (Fig. 3A). All chromosomes show co-linearity to the previously established assembly for C. parvum. Upon closer inspection small segments that aligned to other chromosomes were shown to be telomeric sequences. Thus, these segments did not indicate inaccurate alignments per se but highlighted their repetitive nature (see below for details on telomere reconstruction). However, when assessing the dot plot generated for the Flye-assembled genome (Fig. 3B), we observed larger disagreements compared to GCA_015245375.1. As previously mentioned, one contig from the Flye assembly was small (62 kb) and judged to be an artifact. More problematic, however, was the merger of two Cryptosporidium chromosomes into contig 3 (Fig. 3B, second to last row in dot plot). A fusion of two chromosomes from Cryptosporidium was also observed on contig 7. Overall, these analyses show that we initially missed one contig (seven instead of the expected eight), which was too small (∼62 kb) to represent a chromosome. Thus, the missing two chromosomes were merged with other chromosomes within two contigs from Flye. When comparing both of our assemblies (Canu and Flye) to the previously established GCA_000165345.1, we saw large structural disagreements on both assembly comparisons (Fig. 3). The differences between GCA_000165345.1 and our de novo assemblies are most likely due to structural faults in GCA_000165345.1.

We further carried out a remapping experiment to identify structural disagreements between the Illumina data (short-read) and the long-read assemblies. We mapped the reads and found structural variants (SVs) based on discordant paired-end reads (see Methods) [30]. We identified a total of ten potential SVs over the remapping based on the Flye assembly. The majority of events were insertions (4) followed by duplications (3) and breakend (BND) (2). However, on closer inspection only two SVs (the two BND) showed a misassembly with a homozygous alternative genotype. All other eight SVs showed a minor allele frequency and are likely consequences of mapping artifacts or heterogeneity of the sequenced population. Next, we assessed the Canu assembly, which showed 9 SVs in total. All of the identified SVs showed a low read support, indicating a low probability of being correctly identified and likely originating from mapping artifacts as the material originates from a pure oocyst (see Methods). This assessment demonstrated that the Canu assembly is the better representation of C. parvum compared with the Flye assembly for this study.

Establishing Cryptosporidium assembly

The quality of the Canu-generated draft assembly was further improved by 2 rounds of assembly polishing using the short reads (see Methods). After the first round of polishing, the number of corrections were reduced to ∼20 along the entire genome. The 8 largest contigs available in the final polished assemblies are aligned (see Methods) to the previously published C. parvum reference GCA_015245375.1 [13]. The alignment analysis further confirmed that the 8 contigs represent the previously published chromosomes, while the other contigs appear to be repeats at the start or end of the contigs. Our assembled 8 chromosomes complete 14,669 bp of unresolved sequences (i.e., N). Our assembly also showed a GC content (30.11%) similar to the previous version (30.18%), again attesting to the overall quality.

To further assess the completeness of our assembly, we used BUSCO [21] with the coccidia_odb10 linkage set (see Methods). This analysis confirmed the high quality of our assembly, showing 494 (98.4%) complete re-identified genes from a total of 502. All 494 genes had single copies, indicating that the new assembly is error-free. In addition to these single-copy genes, 3 genes were fragmented and 5 genes were missing from the BUSCO run.

A further comparison with the previous reference genome (GCA_015245375.1) [13] revealed a high consistency, with only 4 SVs (1 insertion, 1 deletion, 1 tandem expansion, and 1 tandem contraction) between the two assemblies. This comparison was performed on the basis of the genomic alignment and using Assemblytics [29].

Last we used the Illumina data set to identify single-nucleotide variants (SNVs) with respect to the new assembly (GCA_019844115.1). Supplementary Fig. S2 shows the allele frequency of the passing SNV (see Methods) and indicates that there are no major differences to be observed and also highlights the purity of the utilized material for the assembly process.

Telomere identification

Telomeric ends present on either end of each chromosome were identified in the Canu genome assembly (see Methods). To search for telomeres, we identified matching sequences of “TTTAGG” repeats [31] in our assemblies (see Methods). Telomeric areas were defined as those with ≥100 repeated sequence matches within a region near the start and end of the contigs. Given these conservative thresholds, we identified a total of 13 telomeric regions. For the majority of chromosomes (2, 3, 4, 5, and 6) telomeric regions were identified at both ends of the chromosomes, thus fully representing the chromosomes from telomere to telomere, including the centromere. Telomeres were observed only at the beginning of chromosomes 7 and 1 and at the end of chromosome 8. We further cross-checked the other contigs that were previously filtered out. These highlighted telomeric sequences but could not be placed automatically to the other chromosomes (i.e., chromosomes 1, 7, or 8). Overall, the identification of the telomeric sequences on almost all of the contigs highlights the overall high quality and continuity of our newly established C. parvum genome. The final assembled genome has been deposited at GenBank (accession GCA_019844115.1).

Assessment of subtyping loci

Cryptosporidium spp. are usually typed and characterized widely by using a small set of genetic markers including gp60, COWP, HSP70, and 18S [32]. Most of the genetic marker data available in GenBank were generated from short-read amplification and sequencing by Sanger, thus providing an improved resolution, but still contain errors arising from manual curation.

The gp60 sequence from the present assembly was aligned with reference sequences retrieved from GenBank. Reference sequences selected for alignment consisted of multiple IIa (C. parvum) subtypes, including a IIaA17G2R1 reference (MK165989) corresponding to the sequenced C. parvum isolate in our study. ClustalW alignment was carried out using BioEdit V7.2.5 (BioEdit, RRID:SCR_007361) with no gaps or large mismatches. The assembled genome has 100% identity with the reference genome IIaA17G2R1, and the genetic markers were observed (see Supplementary Fig. S3).

Conclusion

The present work highlights how next-generation sequencing, including third-generation long-read sequencing, can be used to generate a high-quality genome assembly complete with centromeric regions and numerous telomeres. The genome assembly generated provides a gapless reference compared to the previously published GCA_000165345.1 [13] and extends into some telomeric regions over GCA_015245375.1 [27]. Telomeric regions added to those from GCA_000165345.1, which is a hybrid assembly based on two different subtypes of Cryptosporidium spp. (IIaA17G2R1 and IIaA15G2R1), which might affect further comparison or association studies. In contrast, our study was able to boost the fidelity and robustness of the assembly by focusing on one subtype only, IIaA17G2R1, resulting in a better telomere-to-telomere assembly representation (GCA_019844115.1). Studies of Cryptosporidium spp.are based on genetic markers previously identified for some regions of chromosome 6 and are not able to provide a better understanding of the genetic variation and recombination occurring within the species. Thus, establishing stronger marker genes and perhaps enabling improved recovery of Cryptosporidium-specific sequencing reads by mapping to a high-resolution reference genome will enable better understanding of Cryptosporidium transmission.

A commonly used approach for C. parvum subtyping is based on tandem repeat analysis of gp60, a highly polymorphic gene that encodes for an immunodominant glycoprotein (15/40 kDa) located on the surface of sporozoites and merozoites of many Cryptosporidium species [33]. The present study was done using an isolate propagated in calves by Bunch Grass Farms (Deary, ID, USA). The vendor originally propagated C. parvum IOWA II belonging to subtype IIaA15G2R1 based on gp60 sequencing. This strain has now been replaced with a closely related local isolate belonging to the IIaA17G2R1 subtype. In our work, this isolate is referred to as C. parvum (GCA_019844115.1). It is unclear whether the IIaA17G2R1 evolved from IOWA II, possibly from recombination with another local isolate, or whether it represents a distinct isolate on its own. To our knowledge the assembly done here represents the first IIaA17G2R1 subtype isolate for which long-read sequencing has been performed. C. parvum isolates belonging to the IIaA17G2R1 subtype have been identified in farms in various regions of the world [34–36], were the second most common genotype identified in human cases in a recent study done in Canada [37], and are responsible for causing foodborne outbreaks in the USA [38,39].

Published studies have shown the presence of contingency genes in Cryptosporidium spp., which are responsible for surmounting challenges from the host and are subject to spontaneous mutation rates [40–42]. The majority of these genes are located in the telomere regions of the chromosomes, which are prime sites that evolve and mediate host-parasite interactions [31,43]. In the present assembly, we were able to resolve 13 of the possible 16 telomeres. The capacity to resolve telomeres and subtelomeres across chromosomes in Cryptosporidium spp. will lead to a better understanding of the organism's adaptation to a variety of environmental and host settings.

We utilized two de novo assembly approaches here to obtain a better representation for Cryptosporidium spp. and demonstrated two methods for validating these two assemblies. First, we compared the assemblies from Flye and Canu to pre-existing assemblies from Cryptosporidium spp. from different subtypes and were able to identify certain structural differences. Furthermore, the detection of SVs proved helpful in deciding which assembly best represents the species at hand [20]. This was only possible by having orthogonal sequenced Illumina reads. Other studies might choose a different strategy such as utilizing HiC directly, which would also enable a better scaffolding [44]. For Cryptosporidium spp. this was not necessary because the genome is of relatively small size (∼9 Mb) and encompasses eight chromosomes. The analysis of BUSCO is also an important indication of quality (i.e., completeness and redundancy) but did not indicate incorrect rearrangements identified with the Flye assembly. These types of misassemblies can be readily identified only by comparing closely related reference genomes and/or orthologous data sets (e.g., Illumina short reads).

The final Cryptosporidium spp. assembly will be a helpful resource to advance the study of this important pathogen, further investigate its complexity during growth and development in vitro, and serve as a reference for the study of genetic diversity among different isolates. Furthermore, we hope that it also facilitates translational research that focuses on characterizing virulence, pathogenicity, and host specificity. In this way, new targets may be found leading to vaccines or effective antiparasitic agents to treat this important pathogen.

Methods

DNA extraction

Cryptosporidium parvum oocysts were obtained from Bunchgrass Farm in Deary, ID (Lot No. 22–20, shed date, 10/2/20), and are propagated from IOWA-1 subtype IIaA15G2R1, which was recently replaced by a local isolate subtype IIaA17G2R1 [45]. Purified oocysts (108) were washed in PBS and treated with diluted bleach for 10 minutes on ice to allow for sporozoite excystation. Parasites were pelleted, washed in PBS, and DNA was extracted using Ultrapure™ phenol:chloroform:isoamyl alcohol (Thermo Fisher Scientific, Waltham, MA, USA) followed by ethanol precipitation. Glycoblue™ co-precipitant (Thermo Fisher Scientific, Waltham, MA, USA) was used to facilitate visualization of DNA during extraction and purification steps.

ONT Library preparation and sequencing

NEBNext FFPE DNA Repair Mix (New England Biolabs, Ipswich, MA, USA)was used to repair 620 ng of genomic DNA, which was then followed by end-repair and dA-tailing with NEBNext Ultra II reagents. The dA-tailed insert molecules were further ligated with an ONT adaptor via ligation kit SQK-LSK110 (Oxford Nanopore Technologies, UK). Purification of the library was carried out with AMPure XP beads (Beckman, Cat No. A63880), the final library of 281 ng was loaded to 1 PromethION 24 flow cell (FLO-PRO002, Oxford Nanopore Technologies, UK), and the sequencing data were collected for 24 hours.

Illumina Library preparation and sequencing

DNA (100 ng) was sheared into fragments of ∼300–400 bp in a Covaris E210 system (96-well format, Covaris, Inc., Woburn, MA, USA) followed by purification of the fragmented DNA using AMPure XP beads (Beckman Coulter, Inc. USA. Cat# A63880). DNA end repair, 3′-adenylation, ligation to Illumina multiplexing dual-index adaptors, and ligation-mediated PCR (LM-PCR) were all completed using automated processes. The KAPA HiFi polymerase (KAPA Biosystems Inc. Boston, MA, USA) was used for PCR amplification (10 cycles), which is known to amplify high-GC and low-AT rich regions at greater efficiency. A fragment analyzer (Advanced Analytical Technologies, Inc.,Iowa, USA) electrophoresis system was used for library quantification and size estimation. The libraries were 630 bp (including adaptor and barcode), on average. The library was pooled with other internal samples, with adjustment carried out to yield 3 Gb of data on a NovaSeq 6000 S4 flow cell.

Genome size estimation

We used Jellyfish (Jellyfish, RRID:SCR_005491) (version 2.3.0) to generate a k-mer–based histogram of our raw reads to estimate the genome size based on our short-read data. To obtain this we ran Jellyfish [46, 47] with “jellyfish count -C -m 21 -s 1 000 000 000 -t 10” and subsequently the “histo” module with default parameters. The obtained histogram was loaded into GenomeScope [46] given the appropriate parameter (k-mer size of 21) and haploid genome. GenomeScope provided the overall statistics across the short reads.

Assembly evaluation

We aligned the assembly of Canu (version 2.0) [22] and Flye (version 2.8.1-b1676) [23] with the 2 Cryptosporidium assemblies GCA_000165345.1 and GCA_015245375.1 using nucmer (version 3.1) -maxmatch -l 100 -c 500 [28]. Next, the delta files were evaluated with Assemblytics [29] (version 1.2.1) using the dot plot function. In addition, we mapped the short Illumina reads using bwa mem [48] (0.7.17-r1188) with default parameters to our new assembly. Subsequently, we identified SVs using Manta [49] (v1.6.0) and assessed the VCF file manually. Manta identifies SVs on the basis of abnormally spaced or oriented paired-end Illumina reads here with respect to our new assembly. We further assessed the Illumina data by identifying SNVs using iVar [50] (version 1.3.1) with default parameters. We summarized the allele frequencies across the reads using a custom bash script for PASS variants only.

Assembly and polishing

We used Canu [21] (v2.0) for the assembly, which was based only on Nanopore pass data and a genome size estimate of 9 Mb. On the Nanopore pass reads, we also ran the assembly using Flye [22] (version 2.8.1-b1676) with the default parameters. Subsequently, we aligned the short reads using bwa-mem (version 0.7.17-r1188) with -M -t 10 parameters. Samtools (SAMTOOLS, RRID:SCR_002105) [51] (v1.9) was used to compress and sort the alignments. The so generated alignment was used by Pilon (Pilon, RRID:SCR_014731) [52] (v 1.24) with the parameters “busco–fix bases” by correcting one chromosome after another of the raw assembly. This process was repeated 2 times, achieving a high concordance of the reads and the long-read assembly at the second polishing step.

BUSCO assessment

We ran BUSCO [21] (v5.2.2) to assess the completeness of our assembly using the parameter “busco-m geno-l coccidia_odb10 -i,” coccidia_odb10 (creation date: 5 August 2020, No. of genomes: 20, No. of BUSCOs: 502). The summary statistics generated by BUSCO are presented under Results.

Telomere identification

We used the sequence “TTTAGGTTTAGGTTTAGG” to identify telomeric sequences at the start and end of every contig from our assembly. To do so we used Bowtie (Bowtie 2, RRID:SCR_016368) [53] (version 1.2.3) to align the telomeric sequence back to the assembly with -a parameter. Subsequently we counted the matches across regions using a custom script. In short, we used 10-kb windows to count the number of reported hits, align the genome, and compare the locations with the expected start/end locations. The identified regions were filtered for ≥100 hits to guarantee a robust match. This way, we counted the number of times each chromosome was listed.

Regional comparison

Genetic marker gp60 was used to subtype the assembled genome against available GenBank reference genomes for C. parvum. Representative reference genomes for C. parvum were downloaded from GenBank and were aligned using ClustalW (ClustalW, RRID:SCR_017277) [54] (BioEdit V7.2.5) against the present assembly. Further analysis of the gp60 gene sequence for tandem repeats to determine subtype designation was done following the methods of Alves et al. [55]

Data Availability

The genome assembly is available in the NCBI repository and can be accessed with BioProject PRJNA744539 (GCA_019844115.1). All supporting data and materials are available in the GigaScience GigaDB database [56].

Additional Files

Supplemental Figure S1. Genomescope estimation of genome size.

Supplemental Figure S2. ClustalW alignment of the gp60 coding sequence with the assembly.

Abbreviations

bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; Gb: gigabase pairs; Mb: megabase pairs; NCBI: National Center for Biotechnology Information; ONT: Oxford Nanopore Technologies; PBS: phosphate-buffered saline; SNV: single-nucleotide variant; SV: structural variant.

Conflict of Interests

F.J.S. has presented at both ONT- and Pacific Biosciences–sponsored conferences. The authors declare that they have no other competing interests.

Funding

This work was supported by the National Institute of Allergy and Infectious Diseases (Grant No. 1U19AI144297).

Authors’ Contributions

F.J.S. and V.K.M.: Conceptualization, Analysis, and Writing—Original Draft Preparation

C.L.C. and G.A.M.: Conceptualization and Writing—Review & Editing

P.C.O.: Conceptualization, Resources, and Writing—Review & Editing

H.D., Q.M., and D.M.M.: Conceptualization and Writing—Review & Editing

S.S., S.B., K.K., G.W., H.S., V.V., and Y.H.: Methodology and Investigation

M.C.R., K.L.H., and S.J.C.: Conceptualization

M.M. and M.M.: Analysis

R.A.G. and J.F.P.: Conceptualization, Funding Acquisition

References

1.

Dong
 
S
,
Yang
 
Y
,
Wang
 
Y
, et al.  
Prevalence of Cryptosporidium infection in the global population: a systematic review and meta-analysis
.
Acta Parasitol
.
2020
;
65
(
4
):
882
9
.

2.

Head
 
MG
,
Brown
 
RJ
,
Newell
 
M-L
, et al.  
The allocation of USdollar;105 billion in global funding from G20 countries for infectious disease research between 2000 and 2017: a content analysis of investments
.
Lancet Glob Health
.
2020
;
8
(
10
):
e1295
304
.

3.

Fayer
 
R
,
Morgan
 
U
,
Upton
 
SJ.
 
Epidemiology of Cryptosporidium: transmission, detection and identification
.
Int J Parasitol
.
2000
;
30
(
12-13
):
1305
22
.

4.

Fayer
 
R.
 
Cryptosporidium: a water-borne zoonotic parasite
.
Vet Parasitol
.
2004
;
126
(
1-2
):
37
56
.

5.

Xiao
 
L
,
Feng
 
Y.
 
Molecular epidemiologic tools for waterborne pathogens Cryptosporidium spp. and Giardia duodenalis
.
Food Waterborne Parasitol
.
2017
;
8-9
:
14
32
.

6.

Parasites - Cryptosporidium (also known as “Crypto”). https://www.cdc.gov/parasites/crypto/index.html
.
2019
.
Accessed 20 May 2021
.

7.

Platts-Mills
 
JA
,
Babji
 
S
,
Bodhidatta
 
L
, et al.  
Pathogen-specific burdens of community diarrhoea in developing countries: a multisite birth cohort study (MAL-ED)
.
Lancet Glob Health
.
2015
;
3
(
9
):
e564
75
.

8.

Kotloff
 
KL
,
Nataro
 
JP
,
Blackwelder
 
WC
, et al.  
Burden and aetiology of diarrhoeal disease in infants and young children in developing countries (the Global Enteric Multicenter Study, GEMS): a prospective, case-control study
.
Lancet
.
2013
;
382
(
9888
):
209
22
.

9.

Heo
 
I
,
Dutta
 
D
,
Schaefer
 
DA
, et al.  
Modelling Cryptosporidiuminfection in human small intestinal and lung organoids
.
Nat Microbiol
.
2018
;
3
(
7
);
814
23
.

10.

Cardenas
 
D
,
Bhalchandra
 
S
,
Lamisere
 
H
, et al.  
Two- and three-dimensional bioengineered human intestinal tissue models for Cryptosporidium
.
Methods Mol Biol
.
2020
;
2052
:
373
402
.

11.

Vinayak
 
S
,
Pawlowic
 
MC
,
Sateriale
 
A
, et al.  
Genetic modification of the diarrhoeal pathogen Cryptosporidium parvum
.
Nature
.
2015
;
523
(
7561
):
477
80
.

12.

Hoe
 
TW.
 
Exploring the impact of serious games for cognitive functions through the Humphrey Fellowship Programme
.
Malays J Med Sci
.
2018
;
25
(
3
):doi:.

13.

Abrahamsen
 
MS
,
Templeton
 
TJ
,
Enomoto
 
S
, et al.  
Complete genome sequence of the apicomplexan, Cryptosporidium parvum
.
Science
.
2004
;
304
(
5669
):
441
5
.

14.

Dong
 
L
,
Wang
 
X
,
Guo
 
H
, et al.  
Chromosome-level genome assembly of the endangered humphead wrasse Cheilinus undulatus: insight into the expansion of opsin genes in fishes
.
Mol Ecol Resour
.
2021
;
21
(
7
):
2388
406
.

15.

Brancaccio
 
RN
,
Robitaille
 
A
,
Dutta
 
S
, et al.  
MinION nanopore sequencing and assembly of a complete human papillomavirus genome
.
J Virol Methods
.
2021
;
294
:
114180
.

16.

Espiritu
 
HM
,
Mamuad
 
LL
,
Jin
 
S-J
, et al.  
High quality genome sequence of Treponema phagedenis KS1 isolated from bovine digital dermatitis
.
J Anim Sci Technol
.
2020
;
62
(
6
):
948
51
.

17.

Cuscó
 
A
,
Pérez
 
D
,
Viñes
 
J
, et al.  
Long-read metagenomics retrieves complete single-contig bacterial genomes from canine feces
.
BMC Genomics
.
2021
;
22
(
1
):doi:.

18.

Sun
 
F
,
Sun
 
S
,
Ye
 
W
, et al.  
Genome sequence data of three formae speciales of Phytophthora vignaecausing Phytophthora stem rot on different Vigna species
.
Plant Dis
.
2021
;
105
(
11
):
3732
5
.

19.

De Coster
 
W
,
Weissensteiner
 
MH
,
Sedlazeck
 
FJ.
 
Towards population-scale long-read sequencing
.
Nat Rev Genet
.
2021
;
22
(
9
):
572
87
.

20.

Sedlazeck
 
FJ
,
Lee
 
H
,
Darby
 
CA
, et al.  
Piercing the dark matter: bioinformatics of long-range sequencing and mapping
.
Nat Rev Genet
.
2018
;
19
(
6
):
329
46
.

21.

Simão
 
FA
,
Waterhouse
 
RM
,
Ioannidis
 
P
, et al.  
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
.
Bioinformatics
.
2015
;
31
(
19
):
3210
2
.

22.

Koren
 
S
,
Walenz
 
BP
,
Berlin
 
K
, et al.  
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
.
Genome Res
.
2017
;
27
(
5
):
722
36
.

23.

Kolmogorov
 
M
,
Yuan
 
J
,
Lin
 
Y
, et al.  
Assembly of long, error-prone reads using repeat graphs
.
Nat Biotechnol
.
2019
;
37
(
5
):
540
6
.

24.

Shafin
 
K
,
Pesout
 
T
,
Lorig-Roach
 
R
, et al.  
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes
.
Nat Biotechnol
.
2020
;
38
(
9
):
1044
53
.

25.

Chin
 
C-S
,
Peluso
 
P
,
Sedlazeck
 
FJ
, et al.  
Phased diploid genome assembly with single-molecule real-time sequencing
.
Nat Methods
.
2016
;
13
(
12
):
1050
4
.

26.

Vurture
 
GW
,
Sedlazeck
 
FJ
,
Nattestad
 
M
, et al.  
GenomeScope: fast reference-free genome profiling from short reads
.
Bioinformatics
.
2017
;
33
(
14
):
2202
4
.

27.

Baptista
 
RP
,
Li
 
Y
,
Sateriale
 
A
, et al.  
Long-read assembly and comparative evidence-based reanalysis of Cryptosporidium genome sequences reveals expanded transporter repertoire and duplication of entire chromosome ends including subtelomeric regions
.
Genome Res
.
2022
;
32
(
1
):
203
13
.

28.

Kurtz
 
S
,
Phillippy
 
A
,
Delcher
 
AL
, et al.  
Versatile and open software for comparing large genomes
.
Genome Biol
.
2004
;
5
(
2
):
R12
.

29.

Nattestad
 
M
,
Schatz
 
MC.
 
Assemblytics: a web analytics tool for the detection of variants from an assembly
.
Bioinformatics
.
2016
;
32
(
19
):
3021
3
.

30.

Mahmoud
 
M
,
Gobet
 
N
,
Cruz-Dávalos
 
DI
, et al.  
Structural variant calling: the long and the short of it
.
Genome Biol
.
2019
;
20
(
1
):doi:.

31.

Liu
 
C
,
Schroeder
 
AA
,
Kapur
 
V
, et al.  
Telomeric sequences of Cryptosporidium parvum
.
Mol Biochem Parasitol
.
1998
;
94
(
2
):
291
6
.

32.

Widmer
 
G
,
Sullivan
 
S.
 
Genomics and population biology of Cryptosporidium species
.
Parasite Immunol
.
2012
;
34
(
2-3
):
61
71
.

33.

Strong
 
WB
,
Gut
 
J
,
Nelson
 
RG.
 
Cloning and sequence analysis of a highly polymorphic Cryptosporidium parvumgene encoding a 60-kilodalton glycoprotein and characterization of its 15- and 45-kilodalton zoite surface antigen products
.
Infect Immun
.
2000
;
68
(
7
):
4117
34
.

34.

Mi
 
R
,
Wang
 
X
,
Huang
 
Y
, et al.  
Prevalence and molecular characterization of Cryptosporidium in goats across four provincial level areas in China
.
PLoS One
.
2014
;
9
(
10
):
e111164
.

35.

Kaupke
 
A
,
Rzeżutka
 
A.
 
Emergence of novel subtypes of Cryptosporidium parvum in calves in Poland
.
Parasitol Res
.
2015
;
114
(
12
):
4709
16
.

36.

Caffarena
 
RD
,
Meireles
 
MV
,
Carrasco-Letelier
 
L
, et al.  
Dairy calves in Uruguay are reservoirs of zoonotic subtypes of and pose a potential risk of surface water contamination
.
Front Vet Sci
.
2020
;
7
:
562
.

37.

Guy
 
RA
,
Yanta
 
CA
,
Muchaal
 
PK
, et al.  
Molecular characterization of Cryptosporidium isolates from humans in Ontario, Canada
.
Parasit Vectors
.
2021
;
14
(
1
):
69
.

38.

Blackburn
 
BG
,
Mazurek
 
JM
,
Hlavsa
 
M
, et al.  
Cryptosporidiosis associated with ozonated apple cider
.
Emerg Infect Dis
.
2006
;
12
(
4
):
684
6
.

39.

Centers for Disease Control and Prevention (CDC)
.
Cryptosporidiosis outbreak at a summer camp–North Carolina
.
MMWR Morb Mortal Wkly Rep
.
2009
;
60
:
918
22
.

40.

Bouzid
 
M
,
Tyler
 
KM
,
Christen
 
R
, et al.  
Multi-locus analysis of human infectiveCryptosporidium species and subtypes using ten novel genetic loci
.
BMC Microbiol
.
2010
;
10
(
1
):doi:.

41.

Widmer
 
G
,
Lee
 
Y
,
Hunt
 
P
, et al.  
Comparative genome analysis of two Cryptosporidium parvum isolates with different host range
.
Infect Genet Evol
.
2012
;
12
(
6
):
1213
21
.

42.

Moxon
 
ER
,
Lenski
 
RE
,
Rainey
 
PB.
 
Adaptive evolution of highly mutable loci in pathogenic bacteria
.
Perspect Biol Med
.
1998
;
42
(
1
):
154
5
.

43.

Bouzid
 
M
,
Hunter
 
PR
,
Chalmers
 
RM
, et al.  
Cryptosporidium pathogenicity and virulence
.
Clin Microbiol Rev
.
2013
;
26
(
1
):
115
34
.

44.

Kadota
 
M
,
Nishimura
 
O
,
Miura
 
H
, et al.  
Multifaceted Hi-C benchmarking: what makes a difference in chromosome-scale genome scaffolding?
.
Gigascience
.
2020
;
9
(
1
):doi:.

45.

Zhang
 
H
,
Zhu
 
G.
 
High-throughput screening of drugs against the growth of Cryptosporidium parvum in vitro by qRT-PCR
.
Methods Mol Biol
.
2020
;
2052
:
319
34
.

46.

Ranallo-Benavidez
 
TR
,
Jaron
 
KS
,
Schatz
 
MC.
 
GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes
.
Nat Commun
.
2020
;
11
(
1
):doi:.

47.

Marçais
 
G
,
Kingsford
 
C.
 
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
.
Bioinformatics
.
2011
;
27
(
6
):
764
70
.

48.

Li
 
H.
 
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
.
2013
; arXiv:1303.3997.

49.

Chen
 
X
,
Schulz-Trieglaff
 
O
,
Shaw
 
R
, et al.  
Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications
.
Bioinformatics
.
2016
;
32
(
8
):
1220
2
.

50.

Grubaugh
 
ND
,
Gangavarapu
 
K
,
Quick
 
J
, et al.  
An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar
.
Genome Biol
.
2019
;
20
(
1
):doi:.

51.

Li
 
H
,
Handsaker
 
B
,
Wysoker
 
A
, et al.  
The Sequence Alignment/Map format and SAMtools
.
Bioinformatics
.
2009
;
25
(
16
):
2078
9
.

52.

Walker
 
BJ
,
Abeel
 
T
,
Shea
 
T
, et al.  
Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement
.
PLoS One
.
2014
;
9
(
11
):
e112963
.

53.

Langmead
 
B
,
Salzberg
 
SL.
 
Fast gapped-read alignment with Bowtie 2
.
Nat Methods
.
2012
;
9
(
4
):
357
9
.

54.

Thompson
 
JD
,
Higgins
 
DG
,
Gibson
 
TJ.
 
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
.
Nucleic Acids Res
.
1994
;
22
(
22
):
4673
80
.

55.

Alves
 
M
,
Ribeiro
 
AM
,
Neto
 
C
, et al.  
Distribution of Cryptosporidiumspecies and subtypes in water samples in Portugal: a preliminary study
.
J Eukaryot Microbiol
.
2006
;
53
(
s1
):
S24
5
.

56.

Sedlazeck
 
FJ
,
Menon
 
VK
,
Okhuysen
 
PC
, et al.  Supporting data for “Fully resolved assembly of Cryptosporidium parvum.”.
GigaScience Database
.
2022
. .

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.