Fully resolved assembly of Cryptosporidium parvum

Abstract Background Cryptosporidium parvum is an apicomplexan parasite commonly found across many host species with a global infection prevalence in human populations of 7.6%. Understanding its diversity and genomic makeup can help in fighting established infections and prohibiting further transmission. The basis of every genomic study is a high-quality reference genome that has continuity and completeness, thus enabling comprehensive comparative studies. Findings Here, we provide a highly accurate and complete reference genome of Cryptosporidium parvum. The assembly is based on Oxford Nanopore reads and was improved using Illumina reads for error correction. We also outline how to evaluate and choose from different assembly methods based on 2 main approaches that can be applied to other Cryptosporidium species. The assembly encompasses 8 chromosomes and includes 13 telomeres that were resolved. Overall, the assembly shows a high completion rate with 98.4% single-copy BUSCO genes. Conclusions This high-quality reference genome of a zoonotic IIaA17G2R1 C. parvum subtype isolate provides the basis for subsequent comparative genomic studies across the Cryptosporidium clade. This will enable improved understanding of diversity, functional, and association studies.

Cryptosporidium parvum is an apicomplexan parasite commonly found across many host 37 species with a global infection prevalence in human populations of 7.6%. As such, it is 38 important to understand the diversity and genomic makeup of this prevalent parasite to 39 fight established infections and prohibit further transmission. The basis of every genomic 40 study is a high quality reference genome that has continuity and completeness, thus 41 enabling comprehensive comparative studies. 42 Findings 43 Here, we provide a highly accurate and complete reference genome of Cryptosporidium 44 parvum. The assembly is based on Oxford Nanopore reads and was improved using 45 Illumina reads for error correction. We also outline how to evaluate and choose from 46 different assembly methods based on two main approaches that can be applied to other 47 Cryptosporidium species. The assembly encompasses 8 chromosomes and includes 13 48 telomeres that were resolved. Overall, the assembly shows a high completion rate with 49 98.4% single copy BUSCO genes. 50 51 Conclusions 52 This high quality reference genome of a zoonotic IIaA17G2R1 C. parvum subtype 53 isolate provides the basis for subsequent comparative genomic studies across the Introduction 59 Cryptosporidium is an apicomplexan parasite of public health and veterinary significance 60 with a recent analysis reporting a global infection prevalence of 7.6% [1]. Historically, 61 limited government and private funding was available to study the epidemiology and 62 molecular dynamics of the organism, but this has recently shifted [2]. 63 Cryptosporidium spp. have been found in 155 species of mammals, including primates 64 [3,4]. Among humans, twenty species of Cryptosporidium spp. have been identified [5]. 65 Although the parasite can be transmitted in a variety of ways, the most common method 66 is via drinking and recreational waters. In the United States, Cryptosporidium is the most 67 common cause of waterborne disease in humans [6]. Studies have shown that 68 Cryptosporidium is responsible for a large proportion of all cases of moderate-to-severe 69 diarrhea in children under the age of two [7,8]. There is currently no vaccine available, 70 and the only approved drug for the treatment of Cryptosporidium- In the current study, we have generated a reference genome for C. parvum by using long-100 read sequencing on the ONT PromethION (PromethION, RRID:SCR_017987) 101 supplemented with short-read data generated on NovaSeq 6000 (Illumina NovaSeq 6000  102 Sequencing System, RRID:SCR_016387) for error correction (see Figure 1). This 103 resulted in a complete reference including all chromosomes and thus represents a gap-104 less representation of this important pathogen. Furthermore, it includes 13 of 16 telomeric 105 sequences. The assembly is available at PRJNA744539 (GCA_019844115.1). In addition 106 to the novel assembly, we lay out our QC process and assessment of the assembly to 107 optimize not only for length but also to assess the overall structure of the draft assemblies. 108 Following this comparison schema, it is easy to choose the most optimal representation. 109 In addition, this schema is applicable for other species as well, from single haploid to 110 more complex organisms like plants or humans. 111 112 Results 113 114 Figure 1: Workflow for the generation of Cryptosporidium parvum assembly. 115 116 We sequenced the C. parvum genome with Oxford Nanopore long-reads (see methods) 117 and obtained a total of ~480Mbp of sequence (Figure 1). This is equivalent to 53x 118 coverage for this genome (~9Mbp genome size). Figure 2 shows overall statistics on 119 read length and coverage. The N50 read length is 15.3 kbp with 10x coverage of reads 120 with ≥30kbp length. Our longest read detected was 808 kbp. In addition, we sequenced 121 the genome using the Illumina NovaSeq 6000 to produce 352x coverage of 150 bp 122 paired end reads. 123 124 Using these short-reads we ran a genome estimation using GenomeScope 125 (GenomeScope, RRID:SCR_017014) [26] to obtain a genome size estimate using a 126 polyploidy of 1. Doing so resulted in an estimate of 9.9Mbp with an 89.24% model fit 127 (see Supplementary Figure 1). Inspection of the resulting data ( Figure 2) highlights 128 that this is a potential overestimation of the genome size itself and thus fits in the realm 129 of the previously reported reference assembly in CryptoDB (GCA_015245375) of 130 ~9.1Mbp. 131 132 Figure 2: Read length distribution and cumulative coverage over the Oxford Nanopore. 133 Sequencing. We obtained a total of 53x coverage with long-reads and even 10x 134 coverage with reads larger than 30kbp (x axis alignments were filtered by "-l 100 -c 500 -maxmatch" for all assemblies following the 164 suggestions from Assemblytics [29], which was used to study the alignment results that 165 were generated ( Figure 3). 166 The dot plot from a MUMmer (MUMmer, RRID:SCR_018171) alignment analysis 167 indicates that the GCA_015245375. 1 [27] and Canu genome assemblies are largely 168 collinear ( Figure 3A). All chromosomes show co-linearity to the previously established 169 assembly for C. parvum. Upon closer inspection small segments that aligned to other 170 chromosomes were shown to be telomeric sequences. Thus, these segments did not 171 indicate inaccurate alignments per se, but highlighted their repetitive nature (see below 172 for details on telomere reconstruction). However, when assessing the dot plot generated 173 for the Flye assembled genome (Figure 3B), we observed larger disagreements 174 compared to GCA_015245375.1. As previously mentioned, one contig from the Flye 175 assembly was small (62 kbp) and judged to be an artifact. More problematic, however, 176 was the merger of two Cryptosporidium chromosomes into contig3 ( Figure 3B, second 177 to last row in dotplot). A fusion of two chromosomes from Cryptosporidium was also 178 observed on contig_7. Overall, these analyses show that while we initially missed one 179 contig (7 instead of the expected 8), which was too small (~62kbp) to represent a 180 chromosome. Thus, the missing two chromosomes were merged with other 181 chromosomes within two contigs from Flye. When comparing both of our assemblies 182 (Canu and Flye) to the previously established GCA_000165345.1, we saw large 183 structural disagreements on both assembly comparisons ( Figure 3A/B). The 184 differences between GCA_000165345.1 and our de novo assemblies are most likely 185 due to structural faults in GCA_000165345.1. 186 We further carried out a remapping experiment to identify structural disagreements 187 between the Illumina data (short-read) and the long-read assemblies. We mapped the 188 reads and found structural variants (SVs) based on discordant paired end reads (see 189 methods) [30]. We identified a total of 10 potential SVs over the remapping based on the 190 Flye assembly. The majority of events were insertions (4) followed by duplications (3)  191 and breakend (BND) (2). However, on closer inspection only two SVs (the two BND) 192 showed a misassembly with a homozygous alternative genotype. All other eight SVs 193 showed a minor allele frequency and are likely consequences of mapping artifacts or 194 heterogeneity of the sequenced population. Next, we assessed the Canu assembly, 195 which showed nine SVs in total. All of the identified SVs showed a low read support, 196 indicating a low probability of being correctly identified and likely originating from 197 mapping artifacts as the material originates from a pure oocyst (see methods). This 198 assessment demonstrated that the Canu assembly is the better representation of C. 199 parvum compared to Flye assembly for this study. 200 Establishing Cryptosporidium Assembly 201 The quality of the Canu generated draft assembly was further improved by two rounds 202 of assembly polishing employing the short-reads (see methods Lastly we used the Illumina data set to identify SNV with respect to the new assembly 231 (GCA_019844115.1). Supplementary Figure 2 shows the allele frequency of the passing 232 SNV (see methods) and indicates that there are no major differences to be observed and 233 also highlights the purity of the utilized material for the assembly process. 234 235 Telomere identification 236 Telomeric ends present on either end of each chromosome were identified in the Canu 237 genome assembly (see methods). To search for telomeres, we identified matching 238 sequences of "TTTAGG" repeats [31] in our assemblies (see methods). Telomeric areas 239 were defined as those with at least 100 repeated sequence matches within a region 240 near the start and end of the contigs. Given these conservative thresholds, we identified 241 a total of 13 telomeric regions. For the majority of chromosomes (2,3,4,5 and 6) 242 telomeric regions were identified at both ends of the chromosomes, thus fully 243 representing the chromosomes from telomere to telomere, including the centromere. 244 telomeres were only observed at the beginning of chromosomes 7, 1 and at the end of 245 chromosome 8. We further-crossed checked the other contigs that were previously 246 filtered out. These highlighted telomeric sequences, but couldn't be placed automatically 247 to the other chromosomes (i.e, chromosomes 1, 7 or 8 Subsequently, we aligned the short reads using bwa-mem (version 0.7.17-r1188) with -393 M -t 10 parameters. Samtools (SAMTOOLS, RRID:SCR_002105) [51] (v1.9) was used 394 to compress and sort the alignments. The so generated alignment was used by Pilon 395 (Pilon, RRID:SCR_014731) [52] (v 1.24) with the parameters "--fix bases "by correcting 396 one chromosome after another of the raw assembly. This process was repeated two 397 times achieving a high concordance of the reads and the long-read assembly at the 2nd 398 polishing step. 399 400 BUSCO assessment 401 We ran BUSCO [21] (v5.2.2) to assess the completeness of our assembly using the 402 parameter "busco-m geno-l coccidia_odb10 -i" , coccidia_odb10 (Creation date: 2020-403 08-05, number of genomes: 20, number of BUSCOs: 502). The summary statistics 404 generated by Busco are presented under results. 405 406 Telomere Identification 407 We used the sequence "TTTAGGTTTAGGTTTAGG" to identify telomeric sequences at 408 the start and end of every contig from our assembly. To do so we used Bowtie (Bowtie 409 2, RRID:SCR_016368) [53] (version 1.2.3) to align the telomeric sequence back to the 410 assembly with -a parameter. Subsequently we counted the matches across regions 411 using a custom script. In short, we used 10kbp windows to count the number of reported 412 hits, align the genome and compare the locations with the expected start/end locations. 413 The identified regions were filtered for at least 100 hits to guarantee a robust match. 414 This way, we counted the number of times each chromosome was listed. 415 416 Regional comparison 417 Genetic marker gp60 was used to subtype the assembled genome against available 418 GenBank We would like to thank the reviewers and editor for this positive feedback of our work. To address the concerns, we improved the description of the method section and certain subparts of the main text. We have also addressed the questions from the reviewers directly below. To highlight the responses we marked them in blue. Furthermore we highlighted the large changes in the main text also in blue.

Reviewer #2:
Manuscript Number: GIGA-D-21-00321 Fully resolved assembly of Cryptosporidium parvum Reviewer comments to authors. Overall, Menon et al. present a significant contribution to the field with this work. Their fully resolved assembly of Cryptosporidium parvum is the first to my knowledge to utilize long read sequencing in whole genome sequencing for this group of protozoan parasites and as such provides validation of previously published work while also improving on current reference standards and providing a robust and well described analysis pipeline for future studies.
In my view, there are only a couple of issues with the paper that should be addressed. The first is a discussion of recent work using metabarcoding (e.g. DOI10.1016/j.meegid.2012.08.017, DOI10.1016/j.ijpara.2017.03.003), which demonstrates mixed infections in clinical samples of patients infected with Cryptosporidium which were missed with consensus Sanger sequencing. In some cases, mixtures of subtype families can be found, though dominance of a single subtype with a few closely related variants is more common and more likely in the current paper. Nonetheless, this may have implications for sequencing since purity of the "culture" cannot be guaranteed and results from the lack of reliable in vitro culture methods for Cryptosporidium.
We would like to thank the reviewer for his / her time and constructive response. In this particular study we started from a pure oocyst DNA and thus the work around metabarcoding is not applicable here. As we understand, metabarcoding would be more appropriate if we would start from a stool /clinical sample where there is a larger genetic diversity present. We have added a sentence in the methods to make this clear.
The second issue I have is with the section on comparative genomics. Strictly speaking calling this a comparative genomics analysis is not correct since the authors do not compare genomes with genomes. Instead, it is based on comparison with a small subset of sanger generated sequences and does not add much to the paper in my view. If it is to be included, the text should be rephrased to better reflect the analyses and the identity (species, subtype, subtype family) of the sequences downloaded from genbank should be presented in more detail. Also, it is unclear what criteria were used to select these sequences from among the many hundreds available for C. parvum and this should be stated too.
We thank the reviewer for pointing this out. We have now renamed the section "Assessment of subtyping loci", which hopefully better reflects our indications. We have rephrased this section to reflect and provide more information on the analysis. The supplementary figures have also been