-
PDF
- Split View
-
Views
-
Annotate
-
Cite
Cite
Vipin K Menon, Pablo C Okhuysen, Cynthia L Chappell, Medhat Mahmoud, Medhat Mahmoud, Qingchang Meng, Harsha Doddapaneni, Vanesa Vee, Yi Han, Sejal Salvi, Sravya Bhamidipati, Kavya Kottapalli, George Weissenberger, Hua Shen, Matthew C Ross, Kristi L Hoffman, Sara Javornik Cregeen, Donna M Muzny, Ginger A Metcalf, Richard A Gibbs, Joseph F Petrosino, Fritz J Sedlazeck, Fully resolved assembly of Cryptosporidium parvum, GigaScience, Volume 11, 2022, giac010, https://doi.org/10.1093/gigascience/giac010
- Share Icon Share
Abstract
Cryptosporidium parvum is an apicomplexan parasite commonly found across many host species with a global infection prevalence in human populations of 7.6%. Understanding its diversity and genomic makeup can help in fighting established infections and prohibiting further transmission. The basis of every genomic study is a high-quality reference genome that has continuity and completeness, thus enabling comprehensive comparative studies.
Here, we provide a highly accurate and complete reference genome of Cryptosporidium parvum. The assembly is based on Oxford Nanopore reads and was improved using Illumina reads for error correction. We also outline how to evaluate and choose from different assembly methods based on 2 main approaches that can be applied to other Cryptosporidium species. The assembly encompasses 8 chromosomes and includes 13 telomeres that were resolved. Overall, the assembly shows a high completion rate with 98.4% single-copy BUSCO genes.
This high-quality reference genome of a zoonotic IIaA17G2R1 C. parvum subtype isolate provides the basis for subsequent comparative genomic studies across the Cryptosporidium clade. This will enable improved understanding of diversity, functional, and association studies.
Introduction
Cryptosporidium is an apicomplexan parasite of public health and veterinary significance with a recent analysis reporting a global infection prevalence of 7.6% [1]. Historically, limited government and private funding was available to study the epidemiology and molecular dynamics of the organism, but this has recently shifted [2].
Cryptosporidium spp. have been found in 155 species of mammals, including primates [3,4]. Among humans, 20 species of Cryptosporidium spp. have been identified [5]. Although the parasite can be transmitted in a variety of ways, the most common method is via drinking and recreational waters. In the United States, Cryptosporidium is the most common cause of waterborne disease in humans [6]. Studies have shown that Cryptosporidium is responsible for a large proportion of all cases of moderate-to-severe diarrhea in children younger than 2 years [7,8]. There is currently no vaccine available, and the only approved drug for the treatment of Cryptosporidium-related diarrhea is nitazoxanide, which has limited activity in immunocompromised patients.
Previously, the inability to complete the life cycle of Cryptosporidium in vitro hampered progress in understanding pathogenesis and exploring new treatment modalities. Recent advances using human organoids support the full parasite life cycle, recapitulate in vivo physiology of host tissues [9–12], and provide a way to study the molecular mechanisms and pathways used by Cryptosporidium during infection. However, to facilitate genomic or association studies, a high-quality reference genome is needed.
Cryptosporidiumparvum (NCBI:txid5807) was included in early genome-sequencing projects owing to its public health importance and high global prevalence. The first reported complete genome assembly for C. parvum Iowa II became available in 2004 [13], generated by random shotgun sequencing approach, resulting in ∼13× genome coverage totaling 9.1 Mb of DNA sequence across all 8 chromosomes. This reference sequence had a reduced coverage across the genome, with multiple gaps, and was not adequate to represent the full breadth of genes present, which could result in misleading interpretations of the isolates being studied. In addition, online repositories such as GenBank, CryptoDB, and the Wellcome Trust Sanger Institute FTP servers provide a range of unassembled, unprocessed raw read sequences.
Long-read sequencing technology has advanced to enable read lengths of 15–20 kb (Pacific Biosciences) and 2–3 Mb (Oxford Nanopore Technologies [ONT]) with low error rates and is frequently used to improve reference genome assembly [5,14–19], thus enabling long continuous assemblies without gaps even across highly repetitive regions [20]. While long-read technologies enable an improved assembly, it is difficult to evaluate which de novo assembly best represents the sample. Currently, the simplest way to rank de novo assemblies is by length [20] (N50) or BUSCO (BUSCO, RRID:SCR_015008) [21] comparison. However, this is not a guarantee that chromosomes are well represented or correctly arranged. Furthermore, the variety of de novo assembly methods (e.g., Canu [Canu, RRID:SCR_015880] [22], Flye [Flye, RRID:SCR_017016] [23], Shasta [24], Falcon [25]) makes it harder to choose the best representation.
In the present study, we have generated a reference genome for C. parvum by using long-read sequencing on the ONT PromethION (PromethION, RRID:SCR_017987) supplemented with short-read data generated on NovaSeq 6000 (Illumina NovaSeq 6000 Sequencing System, RRID:SCR_016387) for error correction (see Fig. 1). This resulted in a complete reference including all chromosomes and thus represents a gapless representation of this important pathogen. Furthermore, it includes 13 of 16 telomeric sequences. The assembly is available at PRJNA744539 (GCA_019844115.1). In addition to the novel assembly, we lay out our quality control process and assessment of the assembly not only to optimize for length but also to assess the overall structure of the draft assemblies. Following this comparison schema, it is easy to choose the most optimal representation. In addition, this schema is applicable for other species as well, from single haploid to more complex organisms such as plants or humans.

: Workflow for the generation of Cryptosporidium parvum assembly.
Results
We sequenced the C. parvum genome with ONT long reads (see Methods) and obtained a total of ∼480 Mb of sequence (Fig. 1). This is equivalent to 53× coverage for this genome (∼9 Mb genome size). Figure 2 shows overall statistics on read length and coverage. The N50 read length is 15.3 kb with 10× coverage of reads with ≥30 kb length. Our longest read detected was 808 kb. In addition, we sequenced the genome using the Illumina NovaSeq 6000 to produce 352× coverage of 150-bp paired-end reads.

: Read length distribution and cumulative coverage over the Oxford Nanopore sequencing. We obtained a total of 53× coverage with long reads and even 10× coverage with reads larger than 30 kb (x axis). The longest read measured was 808 kb.
Using these short reads we ran a genome estimation using GenomeScope (GenomeScope, RRID:SCR_017014) [26] to obtain a genome size estimate using a polyploidy of 1. Doing so resulted in an estimate of 9.9 Mb with an 89.24% model fit (see Supplementary Fig. S1). Inspection of the resulting data ( Supplementary Fig. S1) highlights that this is a potential overestimation of the genome size itself and thus fits in the realm of the previously reported reference assembly in CryptoDB (GCA_015245375) of ∼9.1 Mb.
Assembly and comparison of Cryptosporidium assembly
The initial assembly was carried out with only the ONT reads using Canu [22] (see Methods) and resulted in 25 contigs with eight contigs representing all chromosomes. We obtained a total genome length of 9.19 Mb across eight assembled contigs with an average N50 size of 1.11 Mb (Table 1). The largest contig was 1.4 Mb. Our assembly shows a NG50 similar to that of the assembly published in 2004 (see Fig. 3A).
![: Assembly comparisons. (A) The Canu assembly shows a high concordance with the previously published C. parvum assembly (GCA_015245375.1) [27] (dot plots) and agreements in length (bottom). Nevertheless, clear assembly differences are visual when comparing it to GCA_000165345.1 [13]. (B) The Flye assembly vs the C. parvum assembly (GCA_015245375.1) shows large disagreements. Contig 3 is merged between 2 different Cryptosporidium chromosomes, and 1 chromosome is missing. Also, the length comparison (bottom) shows discrepancies in the beginning, highlighting a very short contig in the end (green track). Interestingly GCA_000165345.1 shows structural differences over both assemblies, likely indicating errors in the previous reference.](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/gigascience/11/10.1093_gigascience_giac010/3/m_giac010fig3.jpeg?Expires=1748028716&Signature=kWDbQgDBZnlr0aDdZpofo6SZHzQ~UhUfVM1HM~LuetZaWBtjD6LgRcG5rGVV8XD1Qn3ydd-Hl~zTod86oARaPBWNX1J5O7dwC0QCfgDE6eEtCeMBw8OxT1YWCiq9VhJIUGd90n70oA19pD9Mvht4rK6CXlwm-GejhG5MuzDF2v5-IlxUGDxHoy8EmG7z9U4B8eFy1C3gPQ0tbC8TCkgGdlszZ158T8acMMNRDCMKen1xHdg9u-Vts3n5ob~jMhxfyI9kmkwlQWExplEPNrBR0soeah07pt1tt1tlQXn-hl0IfwX24wICoRFeiK9PKV5Vvr3Plxz8fGfk4fZAwYce3g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
: Assembly comparisons. (A) The Canu assembly shows a high concordance with the previously published C. parvum assembly (GCA_015245375.1) [27] (dot plots) and agreements in length (bottom). Nevertheless, clear assembly differences are visual when comparing it to GCA_000165345.1 [13]. (B) The Flye assembly vs the C. parvum assembly (GCA_015245375.1) shows large disagreements. Contig 3 is merged between 2 different Cryptosporidium chromosomes, and 1 chromosome is missing. Also, the length comparison (bottom) shows discrepancies in the beginning, highlighting a very short contig in the end (green track). Interestingly GCA_000165345.1 shows structural differences over both assemblies, likely indicating errors in the previous reference.
: Overall assembly statistics and comparison using Quast (QUAST, RRID:SCR_001228) between the present assembly and the previously established assembly
Statistic . | GCA_000165345.1 . | GCA_019844115.1 (newly established) . |
---|---|---|
Total sequence length | 9,102,324 | 9,197,619 |
Total ungapped length | 9,087,655 | 9,197,619 |
Unresolved sequences | 14,669 | 0 |
N50 | 1,104,417 | 1,108,772 |
N90 | 985,969 | 993,129 |
L50 | 4 | 4 |
Total No. of chromosomes | 8 | 8 |
Statistic . | GCA_000165345.1 . | GCA_019844115.1 (newly established) . |
---|---|---|
Total sequence length | 9,102,324 | 9,197,619 |
Total ungapped length | 9,087,655 | 9,197,619 |
Unresolved sequences | 14,669 | 0 |
N50 | 1,104,417 | 1,108,772 |
N90 | 985,969 | 993,129 |
L50 | 4 | 4 |
Total No. of chromosomes | 8 | 8 |
: Overall assembly statistics and comparison using Quast (QUAST, RRID:SCR_001228) between the present assembly and the previously established assembly
Statistic . | GCA_000165345.1 . | GCA_019844115.1 (newly established) . |
---|---|---|
Total sequence length | 9,102,324 | 9,197,619 |
Total ungapped length | 9,087,655 | 9,197,619 |
Unresolved sequences | 14,669 | 0 |
N50 | 1,104,417 | 1,108,772 |
N90 | 985,969 | 993,129 |
L50 | 4 | 4 |
Total No. of chromosomes | 8 | 8 |
Statistic . | GCA_000165345.1 . | GCA_019844115.1 (newly established) . |
---|---|---|
Total sequence length | 9,102,324 | 9,197,619 |
Total ungapped length | 9,087,655 | 9,197,619 |
Unresolved sequences | 14,669 | 0 |
N50 | 1,104,417 | 1,108,772 |
N90 | 985,969 | 993,129 |
L50 | 4 | 4 |
Total No. of chromosomes | 8 | 8 |
We also generated an assembly with Flye assembler [23] (see Methods), which led to a total of seven contigs. However, one contig was only 62,160 bp long (see Fig. 3B). Despite this early warning sign, we compared the two assemblies to identify which one best represented the C. parvum genome using genome alignments and remapping of short reads.
To validate our findings, we first aligned the Canu and Flye assemblies to the previously published C. parvum genome reference [3] using nucmer [28] (v3.23). The nucmer alignments were filtered by “-l 100 -c 500 -maxmatch” for all assemblies following the suggestions from Assemblytics [29], which was used to study the alignment results that were generated (Fig. 3).
The dot plot from a MUMmer (MUMmer, RRID:SCR_018171) alignment analysis indicates that the GCA_015245375.1 [27] and Canu genome assemblies are largely collinear (Fig. 3A). All chromosomes show co-linearity to the previously established assembly for C. parvum. Upon closer inspection small segments that aligned to other chromosomes were shown to be telomeric sequences. Thus, these segments did not indicate inaccurate alignments per se but highlighted their repetitive nature (see below for details on telomere reconstruction). However, when assessing the dot plot generated for the Flye-assembled genome (Fig. 3B), we observed larger disagreements compared to GCA_015245375.1. As previously mentioned, one contig from the Flye assembly was small (62 kb) and judged to be an artifact. More problematic, however, was the merger of two Cryptosporidium chromosomes into contig 3 (Fig. 3B, second to last row in dot plot). A fusion of two chromosomes from Cryptosporidium was also observed on contig 7. Overall, these analyses show that we initially missed one contig (seven instead of the expected eight), which was too small (∼62 kb) to represent a chromosome. Thus, the missing two chromosomes were merged with other chromosomes within two contigs from Flye. When comparing both of our assemblies (Canu and Flye) to the previously established GCA_000165345.1, we saw large structural disagreements on both assembly comparisons (Fig. 3). The differences between GCA_000165345.1 and our de novo assemblies are most likely due to structural faults in GCA_000165345.1.
We further carried out a remapping experiment to identify structural disagreements between the Illumina data (short-read) and the long-read assemblies. We mapped the reads and found structural variants (SVs) based on discordant paired-end reads (see Methods) [30]. We identified a total of ten potential SVs over the remapping based on the Flye assembly. The majority of events were insertions (4) followed by duplications (3) and breakend (BND) (2). However, on closer inspection only two SVs (the two BND) showed a misassembly with a homozygous alternative genotype. All other eight SVs showed a minor allele frequency and are likely consequences of mapping artifacts or heterogeneity of the sequenced population. Next, we assessed the Canu assembly, which showed 9 SVs in total. All of the identified SVs showed a low read support, indicating a low probability of being correctly identified and likely originating from mapping artifacts as the material originates from a pure oocyst (see Methods). This assessment demonstrated that the Canu assembly is the better representation of C. parvum compared with the Flye assembly for this study.
Establishing Cryptosporidium assembly
The quality of the Canu-generated draft assembly was further improved by 2 rounds of assembly polishing using the short reads (see Methods). After the first round of polishing, the number of corrections were reduced to ∼20 along the entire genome. The 8 largest contigs available in the final polished assemblies are aligned (see Methods) to the previously published C. parvum reference GCA_015245375.1 [13]. The alignment analysis further confirmed that the 8 contigs represent the previously published chromosomes, while the other contigs appear to be repeats at the start or end of the contigs. Our assembled 8 chromosomes complete 14,669 bp of unresolved sequences (i.e., N). Our assembly also showed a GC content (30.11%) similar to the previous version (30.18%), again attesting to the overall quality.
To further assess the completeness of our assembly, we used BUSCO [21] with the coccidia_odb10 linkage set (see Methods). This analysis confirmed the high quality of our assembly, showing 494 (98.4%) complete re-identified genes from a total of 502. All 494 genes had single copies, indicating that the new assembly is error-free. In addition to these single-copy genes, 3 genes were fragmented and 5 genes were missing from the BUSCO run.
A further comparison with the previous reference genome (GCA_015245375.1) [13] revealed a high consistency, with only 4 SVs (1 insertion, 1 deletion, 1 tandem expansion, and 1 tandem contraction) between the two assemblies. This comparison was performed on the basis of the genomic alignment and using Assemblytics [29].
Last we used the Illumina data set to identify single-nucleotide variants (SNVs) with respect to the new assembly (GCA_019844115.1). Supplementary Fig. S2 shows the allele frequency of the passing SNV (see Methods) and indicates that there are no major differences to be observed and also highlights the purity of the utilized material for the assembly process.
Telomere identification
Telomeric ends present on either end of each chromosome were identified in the Canu genome assembly (see Methods). To search for telomeres, we identified matching sequences of “TTTAGG” repeats [31] in our assemblies (see Methods). Telomeric areas were defined as those with ≥100 repeated sequence matches within a region near the start and end of the contigs. Given these conservative thresholds, we identified a total of 13 telomeric regions. For the majority of chromosomes (2, 3, 4, 5, and 6) telomeric regions were identified at both ends of the chromosomes, thus fully representing the chromosomes from telomere to telomere, including the centromere. Telomeres were observed only at the beginning of chromosomes 7 and 1 and at the end of chromosome 8. We further cross-checked the other contigs that were previously filtered out. These highlighted telomeric sequences but could not be placed automatically to the other chromosomes (i.e., chromosomes 1, 7, or 8). Overall, the identification of the telomeric sequences on almost all of the contigs highlights the overall high quality and continuity of our newly established C. parvum genome. The final assembled genome has been deposited at GenBank (accession GCA_019844115.1).
Assessment of subtyping loci
Cryptosporidium spp. are usually typed and characterized widely by using a small set of genetic markers including gp60, COWP, HSP70, and 18S [32]. Most of the genetic marker data available in GenBank were generated from short-read amplification and sequencing by Sanger, thus providing an improved resolution, but still contain errors arising from manual curation.
The gp60 sequence from the present assembly was aligned with reference sequences retrieved from GenBank. Reference sequences selected for alignment consisted of multiple IIa (C. parvum) subtypes, including a IIaA17G2R1 reference (MK165989) corresponding to the sequenced C. parvum isolate in our study. ClustalW alignment was carried out using BioEdit V7.2.5 (BioEdit, RRID:SCR_007361) with no gaps or large mismatches. The assembled genome has 100% identity with the reference genome IIaA17G2R1, and the genetic markers were observed (see Supplementary Fig. S3).
Conclusion
The present work highlights how next-generation sequencing, including third-generation long-read sequencing, can be used to generate a high-quality genome assembly complete with centromeric regions and numerous telomeres. The genome assembly generated provides a gapless reference compared to the previously published GCA_000165345.1 [13] and extends into some telomeric regions over GCA_015245375.1 [27]. Telomeric regions added to those from GCA_000165345.1, which is a hybrid assembly based on two different subtypes of Cryptosporidium spp. (IIaA17G2R1 and IIaA15G2R1), which might affect further comparison or association studies. In contrast, our study was able to boost the fidelity and robustness of the assembly by focusing on one subtype only, IIaA17G2R1, resulting in a better telomere-to-telomere assembly representation (GCA_019844115.1). Studies of Cryptosporidium spp.are based on genetic markers previously identified for some regions of chromosome 6 and are not able to provide a better understanding of the genetic variation and recombination occurring within the species. Thus, establishing stronger marker genes and perhaps enabling improved recovery of Cryptosporidium-specific sequencing reads by mapping to a high-resolution reference genome will enable better understanding of Cryptosporidium transmission.
A commonly used approach for C. parvum subtyping is based on tandem repeat analysis of gp60, a highly polymorphic gene that encodes for an immunodominant glycoprotein (15/40 kDa) located on the surface of sporozoites and merozoites of many Cryptosporidium species [33]. The present study was done using an isolate propagated in calves by Bunch Grass Farms (Deary, ID, USA). The vendor originally propagated C. parvum IOWA II belonging to subtype IIaA15G2R1 based on gp60 sequencing. This strain has now been replaced with a closely related local isolate belonging to the IIaA17G2R1 subtype. In our work, this isolate is referred to as C. parvum (GCA_019844115.1). It is unclear whether the IIaA17G2R1 evolved from IOWA II, possibly from recombination with another local isolate, or whether it represents a distinct isolate on its own. To our knowledge the assembly done here represents the first IIaA17G2R1 subtype isolate for which long-read sequencing has been performed. C. parvum isolates belonging to the IIaA17G2R1 subtype have been identified in farms in various regions of the world [34–36], were the second most common genotype identified in human cases in a recent study done in Canada [37], and are responsible for causing foodborne outbreaks in the USA [38,39].
Published studies have shown the presence of contingency genes in Cryptosporidium spp., which are responsible for surmounting challenges from the host and are subject to spontaneous mutation rates [40–42]. The majority of these genes are located in the telomere regions of the chromosomes, which are prime sites that evolve and mediate host-parasite interactions [31,43]. In the present assembly, we were able to resolve 13 of the possible 16 telomeres. The capacity to resolve telomeres and subtelomeres across chromosomes in Cryptosporidium spp. will lead to a better understanding of the organism's adaptation to a variety of environmental and host settings.
We utilized two de novo assembly approaches here to obtain a better representation for Cryptosporidium spp. and demonstrated two methods for validating these two assemblies. First, we compared the assemblies from Flye and Canu to pre-existing assemblies from Cryptosporidium spp. from different subtypes and were able to identify certain structural differences. Furthermore, the detection of SVs proved helpful in deciding which assembly best represents the species at hand [20]. This was only possible by having orthogonal sequenced Illumina reads. Other studies might choose a different strategy such as utilizing HiC directly, which would also enable a better scaffolding [44]. For Cryptosporidium spp. this was not necessary because the genome is of relatively small size (∼9 Mb) and encompasses eight chromosomes. The analysis of BUSCO is also an important indication of quality (i.e., completeness and redundancy) but did not indicate incorrect rearrangements identified with the Flye assembly. These types of misassemblies can be readily identified only by comparing closely related reference genomes and/or orthologous data sets (e.g., Illumina short reads).
The final Cryptosporidium spp. assembly will be a helpful resource to advance the study of this important pathogen, further investigate its complexity during growth and development in vitro, and serve as a reference for the study of genetic diversity among different isolates. Furthermore, we hope that it also facilitates translational research that focuses on characterizing virulence, pathogenicity, and host specificity. In this way, new targets may be found leading to vaccines or effective antiparasitic agents to treat this important pathogen.
Methods
DNA extraction
Cryptosporidium parvum oocysts were obtained from Bunchgrass Farm in Deary, ID (Lot No. 22–20, shed date, 10/2/20), and are propagated from IOWA-1 subtype IIaA15G2R1, which was recently replaced by a local isolate subtype IIaA17G2R1 [45]. Purified oocysts (108) were washed in PBS and treated with diluted bleach for 10 minutes on ice to allow for sporozoite excystation. Parasites were pelleted, washed in PBS, and DNA was extracted using Ultrapure™ phenol:chloroform:isoamyl alcohol (Thermo Fisher Scientific, Waltham, MA, USA) followed by ethanol precipitation. Glycoblue™ co-precipitant (Thermo Fisher Scientific, Waltham, MA, USA) was used to facilitate visualization of DNA during extraction and purification steps.
ONT Library preparation and sequencing
NEBNext FFPE DNA Repair Mix (New England Biolabs, Ipswich, MA, USA)was used to repair 620 ng of genomic DNA, which was then followed by end-repair and dA-tailing with NEBNext Ultra II reagents. The dA-tailed insert molecules were further ligated with an ONT adaptor via ligation kit SQK-LSK110 (Oxford Nanopore Technologies, UK). Purification of the library was carried out with AMPure XP beads (Beckman, Cat No. A63880), the final library of 281 ng was loaded to 1 PromethION 24 flow cell (FLO-PRO002, Oxford Nanopore Technologies, UK), and the sequencing data were collected for 24 hours.
Illumina Library preparation and sequencing
DNA (100 ng) was sheared into fragments of ∼300–400 bp in a Covaris E210 system (96-well format, Covaris, Inc., Woburn, MA, USA) followed by purification of the fragmented DNA using AMPure XP beads (Beckman Coulter, Inc. USA. Cat# A63880). DNA end repair, 3′-adenylation, ligation to Illumina multiplexing dual-index adaptors, and ligation-mediated PCR (LM-PCR) were all completed using automated processes. The KAPA HiFi polymerase (KAPA Biosystems Inc. Boston, MA, USA) was used for PCR amplification (10 cycles), which is known to amplify high-GC and low-AT rich regions at greater efficiency. A fragment analyzer (Advanced Analytical Technologies, Inc.,Iowa, USA) electrophoresis system was used for library quantification and size estimation. The libraries were 630 bp (including adaptor and barcode), on average. The library was pooled with other internal samples, with adjustment carried out to yield 3 Gb of data on a NovaSeq 6000 S4 flow cell.
Genome size estimation
We used Jellyfish (Jellyfish, RRID:SCR_005491) (version 2.3.0) to generate a k-mer–based histogram of our raw reads to estimate the genome size based on our short-read data. To obtain this we ran Jellyfish [46, 47] with “jellyfish count -C -m 21 -s 1 000 000 000 -t 10” and subsequently the “histo” module with default parameters. The obtained histogram was loaded into GenomeScope [46] given the appropriate parameter (k-mer size of 21) and haploid genome. GenomeScope provided the overall statistics across the short reads.
Assembly evaluation
We aligned the assembly of Canu (version 2.0) [22] and Flye (version 2.8.1-b1676) [23] with the 2 Cryptosporidium assemblies GCA_000165345.1 and GCA_015245375.1 using nucmer (version 3.1) -maxmatch -l 100 -c 500 [28]. Next, the delta files were evaluated with Assemblytics [29] (version 1.2.1) using the dot plot function. In addition, we mapped the short Illumina reads using bwa mem [48] (0.7.17-r1188) with default parameters to our new assembly. Subsequently, we identified SVs using Manta [49] (v1.6.0) and assessed the VCF file manually. Manta identifies SVs on the basis of abnormally spaced or oriented paired-end Illumina reads here with respect to our new assembly. We further assessed the Illumina data by identifying SNVs using iVar [50] (version 1.3.1) with default parameters. We summarized the allele frequencies across the reads using a custom bash script for PASS variants only.
Assembly and polishing
We used Canu [21] (v2.0) for the assembly, which was based only on Nanopore pass data and a genome size estimate of 9 Mb. On the Nanopore pass reads, we also ran the assembly using Flye [22] (version 2.8.1-b1676) with the default parameters. Subsequently, we aligned the short reads using bwa-mem (version 0.7.17-r1188) with -M -t 10 parameters. Samtools (SAMTOOLS, RRID:SCR_002105) [51] (v1.9) was used to compress and sort the alignments. The so generated alignment was used by Pilon (Pilon, RRID:SCR_014731) [52] (v 1.24) with the parameters “busco–fix bases” by correcting one chromosome after another of the raw assembly. This process was repeated 2 times, achieving a high concordance of the reads and the long-read assembly at the second polishing step.
BUSCO assessment
We ran BUSCO [21] (v5.2.2) to assess the completeness of our assembly using the parameter “busco-m geno-l coccidia_odb10 -i,” coccidia_odb10 (creation date: 5 August 2020, No. of genomes: 20, No. of BUSCOs: 502). The summary statistics generated by BUSCO are presented under Results.
Telomere identification
We used the sequence “TTTAGGTTTAGGTTTAGG” to identify telomeric sequences at the start and end of every contig from our assembly. To do so we used Bowtie (Bowtie 2, RRID:SCR_016368) [53] (version 1.2.3) to align the telomeric sequence back to the assembly with -a parameter. Subsequently we counted the matches across regions using a custom script. In short, we used 10-kb windows to count the number of reported hits, align the genome, and compare the locations with the expected start/end locations. The identified regions were filtered for ≥100 hits to guarantee a robust match. This way, we counted the number of times each chromosome was listed.
Regional comparison
Genetic marker gp60 was used to subtype the assembled genome against available GenBank reference genomes for C. parvum. Representative reference genomes for C. parvum were downloaded from GenBank and were aligned using ClustalW (ClustalW, RRID:SCR_017277) [54] (BioEdit V7.2.5) against the present assembly. Further analysis of the gp60 gene sequence for tandem repeats to determine subtype designation was done following the methods of Alves et al. [55]
Data Availability
The genome assembly is available in the NCBI repository and can be accessed with BioProject PRJNA744539 (GCA_019844115.1). All supporting data and materials are available in the GigaScience GigaDB database [56].
Additional Files
Supplemental Figure S1. Genomescope estimation of genome size.
Supplemental Figure S2. ClustalW alignment of the gp60 coding sequence with the assembly.
Abbreviations
bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; Gb: gigabase pairs; Mb: megabase pairs; NCBI: National Center for Biotechnology Information; ONT: Oxford Nanopore Technologies; PBS: phosphate-buffered saline; SNV: single-nucleotide variant; SV: structural variant.
Conflict of Interests
F.J.S. has presented at both ONT- and Pacific Biosciences–sponsored conferences. The authors declare that they have no other competing interests.
Funding
This work was supported by the National Institute of Allergy and Infectious Diseases (Grant No. 1U19AI144297).
Authors’ Contributions
F.J.S. and V.K.M.: Conceptualization, Analysis, and Writing—Original Draft Preparation
C.L.C. and G.A.M.: Conceptualization and Writing—Review & Editing
P.C.O.: Conceptualization, Resources, and Writing—Review & Editing
H.D., Q.M., and D.M.M.: Conceptualization and Writing—Review & Editing
S.S., S.B., K.K., G.W., H.S., V.V., and Y.H.: Methodology and Investigation
M.C.R., K.L.H., and S.J.C.: Conceptualization
M.M. and M.M.: Analysis
R.A.G. and J.F.P.: Conceptualization, Funding Acquisition