Annotated draft genome sequences of three species of Cryptosporidium: Cryptosporidium meleagridis isolate UKMEL1, C. baileyi isolate TAMU-09Q1 and C. hominis isolates TU502_2012 and UKH1

Human cryptosporidiosis is caused primarily by Cryptosporidium hominis, C. parvum and C. meleagridis. To accelerate research on parasites in the genus Cryptosporidium, we generated annotated, draft genome sequences of human C. hominis isolates TU502_2012 and UKH1, C. meleagridis UKMEL1, also isolated from a human patient, and the avian parasite C. baileyi TAMU-09Q1. The annotation of the genome sequences relied in part on RNAseq data generated from the oocyst stage of both C. hominis and C. baileyi. The genome assembly of C. hominis is significantly more complete and less fragmented than that available previously, which enabled the generation of a much-improved gene set for this species, with an increase in average gene length of 500 bp relative to the protein-encoding genes in the 2004 C. hominis annotation. Our results reveal that the genomes of C. hominis and C. parvum are very similar in both gene density and average gene length. These data should prove a valuable resource for the Cryptosporidium research community.

average gene length of 500 bp relative to the protein-encoding genes in the 2004 C. hominis annotation. Our results reveal that the genomes of C. hominis and C. parvum are very similar in both gene density and average gene length. These data should prove a valuable resource for the Cryptosporidium research community.
Keywords: Cryptosporidium; C. hominis TU502 2012; Cryptosporidium meleagridis; Cryptosporidium baileyi; genome assembly; annotation Cryptosporidium parasites (Phylum: Apicomplexa) infect a wide range of vertebrates, from fish to humans, and are the causative agents of cryptosporidiosis in humans (Upton and Current 1985;Tzipori 1988;Widmer and Sullivan 2012). A recent, large, multicenter study of the etiology of moderate-to-severe diarrhea (MSD) in infants in the developing world found Cryptosporidium hominis to be among the four predominant pathogens associated with MSD in children under 5 years of age (Kotloff et al. 2013). Some Cryptosporidium species are capable of zoonotic transmission (Ryan, Fayer and Xiao 2014). Comparative analysis of genomes from diverse Cryptosporidium species and related protists is essential to fully understand the biology, pathology, host specificity and evolution of this genus.
The reference C. parvum IOWA II genome (Abrahamsen et al. 2004) is essentially complete, with its eight chromosomes distributed among 18 contigs, including full-length chromosomes. In contrast, the reference assembly of C. hominis, based on isolate TU502, published in 2004 (Xu et al. 2004), is a highly fragmented draft genome consisting of 1422 contigs. To accelerate research on these pathogens of public health and veterinary significance, we sequenced, assembled and annotated four Cryptosporidium genome sequences belonging to three species as part of a community White Paper undertaking. Two sequences were generated from a species infective to humans, C. hominis isolates TU502 2012 and UKH1. In addition, sequences were generated from the generalist species C. meleagridis, isolate UKMEL1, and from the TAMU-09Q1 isolate of C. baileyi, an avian-infecting parasite. All three species are enteric parasites. Cryptosporidium baileyi can complete its entire life cycle in embryonated chicken eggs, making it a useful laboratory model to address some aspects of Cryptosporidium biology. Cryptosporidium meleagridis appears to lack host specificity, as it is known to infect both avian and mammalian species (Akiyoshi et al. 2003).
Cryptosporidium hominis UKH1 and C. meleagridis UKMEL1 oocysts were isolated from fecal samples of naturally infected humans. Cryptosporidium meleagridis oocysts were propagated in immunosuppressed adult CD-1 mice, and C. hominis UKH1 in neonatal gnotobiotic pigs. Cryptosporidium hominis TU502 2012 originates from C. hominis TU502 isolate maintained by serial propagation in gnotobiotic pigs (Tzipori et al. 1994;Xu et al. 2004). Cryptosporidium baileyi oocysts were extracted from experimentally infected embryonated chicken eggs. Prior to isolating DNA, extracted oocysts were purified on density gradients (Widmer, Feng and Tanriverdi 2004) and surface-sterilized with bleach to minimize contamination with host and bacterial DNA. RNA samples were obtained from C. hominis TU502 2012 and C. baileyi TAMU-10GZ1 oocysts <4 months old, and sequenced to high coverage using strand-specific RNASeq (Parkhomchuk et al. 2009). De novo assembly of the genomic reads was performed using MaSuRCA version1.9 (Zimin et al. 2013) (Table 1).
All the genomes except C. hominis UKH1 were annotated using a semi-automated approach. We trained Augustus (Stanke et al. 2004) using a set of previously manually curated genes. Consensus predictor EVidence Modeler, EVM (Haas et al. 2008), was used to generate annotations based on predictions from Augustus and GeneMark-ES (Borodovsky and Lomsadze 2011), transcripts assembled from RNAseq reads and matches to a set of highly conserved eukaryotic genes-the Core Eukaryotic Genes Mapping Approach genes (Parra, Bradnam and Korf 2007). In addition, 394 genes (∼10% of all genes) in the C. hominis TU502 2012 genome were manually annotated using Web Apollo (Lee et al. 2013). The manually curated genes are thought to encode antigens (Ifeonu et al., in preparartion). The C. hominis genes TU502 2012 were mapped to the C. hominis UKH1 assembly using GMAP (v2015-12-31), and filtered to include only matches that extend at least over 95% of the sequences and have ≥95% alignment identity at the amino acid level. The final assembly attributes are listed in Table 1. This Whole Genome Shotgun project has been deposited in DDBJ/EMBL/GenBank under the accession numbers listed in Table 1 and the sequences are accessible at CryptoDB (http://CryptoDB.org). These are the first versions of genome sequence assemblies and annotations for each isolate.
The genome of C. hominis isolate TU502 has been sequenced previously (Xu et al. 2004). We resequenced the genome of this isolate, after multiple passages, in an attempt to improve the reference genome assembly and gene set for this species. The resulting C. hominis TU502 2012 genome assembly consists of only 119 contigs, a 10-fold reduction relative to the 2004 assembly. The genome assembly is now more complete, and roughly the same size as that of C. parvum, which is also 9.1 Mbp in length (Abrahamsen et al. 2004). The genes in the new annotation are on average 500 bp longer than their counterparts in the original 2004 annotation, resulting in an increase of 17% in the fraction of the genome that encodes for proteins. In order to determine if this gene structural annotation is more accurate than the one published in 2004, we compared the length of all C. parvum IOWA II proteins with their orthologs in either C. hominis TU502 or C. hominis TU502 2012. The distribution of length differences based on the comparison to the 2012 reannotation indeed has lower variance, with an additional 500 genes similar in length between the two species (Fig. 1). Also, there are 538 C. parvum genes without orthologs in the C. hominis TU502 2004 annotation compared to only 288 such cases in the 2012 annotation. Interestingly, while the original C. hominis annotation had a preponderance of genes shorter than their C. parvum orthologs, the current gene set is skewed in the opposite direction (Fig. 1). Whether this difference is real, or a result of remaining gene structure errors in one or both species, remains to be determined. The C. hominis TU502 2012 annotation contains 206 predicted protein-coding genes with no orthologs in C. parvum IOWA II. Of the 3745 predicted protein-coding genes in C. hominis TU502 2012, only 63% are also found in all other annotated Cryptosporidium genomes available to date: C. parvum IOWA II, C. meleagridis UKMEL1, C. baileyi TAMU-09Q1 and C. muris RN66 (Fig. 1). Finally, 110 predicted protein-coding genes are present in the three newly sequenced genomes, but homologs are absent in the current C. parvum predicted proteome. These significant differences in gene content among species are, in all likelihood, due mostly to the limitations of the semi-automated annotation  SNPs were categorized as coding and non-coding, given the assembly and the annotation, using VCFtools. approach used, rather than to true instances of gene gain/loss. An intense, manual curation effort of the genome annotation of each species is ongoing, and will be essential to validate these results.
Genetic differences among C. hominis isolates were identified by read mapping, followed by calling and filtering of single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels). A total of 10 526 sequence variants were identified in C. hominis TU502 2012 relative to the reference C. hominis TU502 assembly; in contrast, only 4394 sequence variants were found between C. hominis UKH1 and the reference C. hominis. Interestingly, the vast majority of the differences relative to the reference TU502 genome are shared between the two new isolates (Fig. 1). A plausible explanation, which remains to be verified, is that these SNPs common to both new isolates are in fact sequencing errors in the original C. hominis TU502 assembly, which was based on low-coverage Sanger sequencing. This, however, does not explain the fact C. hominis TU502 2012 has more differences relative to TU502 than does UKH1. It is possible that during the approximate 20 passages in gnotobiotic pigs which C. hominis TU502 2012 isolate has experienced between 2004 and 2012, the make-up of the parasite population has shifted. In the absence of methods for cloning and expanding single Cryptosporidium sporozoites, the isolates sequenced to date are likely to be heterogeneous populations (Grinberg and Widmer 2016). In fact, high-throughput sequencing of a polymorphic locus demonstrated the presence of multiple alleles in laboratory and natural Cryptosporidium isolates (Widmer et al. 2015). We generated RNAseq data for two of the species, C. hominis and C. baileyi. These data are strand specific, a tremendous advantage when attempting to generate accurate gene-specific expression values in highly gene-dense genomes, where neighboring transcriptional units often overlap (Tretina, Pelle and Silva 2016). The quantity of RNAseq data generated for C. hominis UKH1 was six times than that for the TU502 2012 isolate (Table 1). Despite this difference, the relative expression values for each gene are remarkably similar for the two isolates (r 2 ∼ 0.96; Fig. 2), which supports the strength of the relative expression results. The RNAseq data generated from oocysts indicate that ∼50% and ∼60% of protein-coding genes are expressed in C. hominis TU502 2012 and C. baileyi, respectively, during this stage of the life cycle (Table 1). Gene expression is also positively correlated between species (r 2 ∼ 0.51; Fig. 2), with lactate/malate dehydrogenase (LDH), a GDP-fucose transporter, agrin and the ubiquitous heat shock protein 90 (HSP90) being among the most highly expressed genes in both species. LDH and HSP90 have been shown to be among the top nine most highly expressed genes in C. parvum oocysts (Zhang et al. 2012). Genes preferentially expressed in one or the other species may provide a good starting point to investigate biological differences between taxa. Among the genes that differ most in expression level between the two species are pyridine nucleotide-disulphide oxidoreductase, which has a higher level of expression in C. hominis, and AhpC/TSA family protein, WD repeat-containing protein 82 and DNA mismatch repair protein msh-2, all of which have higher expression levels in C. baileyi.
The work on Cryptosporidium genomes and their respective annotations with particular emphasis on the manual curation of the structure and function of all protein-coding genes is continuing. Together with the identification of genes unique to each species and genes with species-specific expression profiles, this work will facilitate the identification of genes responsible for host specificity and other phenotypes relevant to the understanding of cryptosporidiosis.