High-quality Schistosoma haematobium genome achieved by single-molecule and long-range sequencing

Abstract Background Schistosoma haematobium causes urogenital schistosomiasis, a neglected tropical disease affecting >100 million people worldwide. Chronic infection with this parasitic trematode can lead to urogenital conditions including female genital schistosomiasis and bladder cancer. At the molecular level, little is known about this blood fluke and the pathogenesis of the disease that it causes. To support molecular studies of this carcinogenic worm, we reported a draft genome for S. haematobium in 2012. Although a useful resource, its utility has been somewhat limited by its fragmentation. Findings Here, we systematically enhanced the draft genome of S. haematobium using a single-molecule and long-range DNA-sequencing approach. We achieved a major improvement in the accuracy and contiguity of the genome assembly, making it superior or comparable to assemblies for other schistosome species. We transferred curated gene models to this assembly and, using enhanced gene annotation pipelines, inferred a gene set with as many or more complete gene models as those of other well-studied schistosomes. Using conserved, single-copy orthologs, we assessed the phylogenetic position of S. haematobium in relation to other parasitic flatworms for which draft genomes were available. Conclusions We report a substantially enhanced genomic resource that represents a solid foundation for molecular research on S. haematobium and is poised to better underpin population and functional genomic investigations and to accelerate the search for new disease interventions.


Background
Schistosoma haematobium causes urogenital schistosomiasis, a neglected tropical disease affecting > 100 million people worldwide. Chronic infection with this parasitic trematode can lead to urogenital pathology including female genital schistosomiasis (FGS) and bladder cancer. At the molecular level, little is known about the biology of this blood fluke and the pathogenesis of the disease that it causes. To support molecular studies of this carcinogenic worm, we reported a draft genome for S. haematobium in 2012. Although a useful resource, the utility of this draft genome has been somewhat limited by its fragmentation.

Findings
Here, we systematically enhanced the draft genome of S. haematobium using a singlemolecule and long-range DNA sequencing approach. We achieved a major improvement in the accuracy and contiguity of the genome assembly, making it superior or comparable to assemblies for other schistosome species. Using improved gene annotation pipelines, we inferred a gene set with as many or more complete gene models compared with those of other well-studied schistosomes. Employing conserved, single-copy orthologs, we assessed the phylogenetic position of S. haematobium in relation to other parasitic flatworms for which draft genomes were available.

Conclusions
We report a substantially enhanced genomic resource that represents a solid foundation for molecular research on S. haematobium and is poised to better underpin population and functional genomic investigations, and to accelerate the search for new disease interventions. worldwide and resulting in > 300,000 deaths each year [1]. Schistosoma haematobium (mainly 69 in Africa; Fig. 1 iteratively identify and break mis-assemblies and re-scaffold contigs using an established 192 method [32].

336
A comparison of the number of gaps in the portion of the Shae.V2 assembly representing 337 the S. mansoni chromosomes (Fig. 3) showed that the improved S. haematobium assembly 338 contained less (n = 3128) gaps than the S. mansoni genome assembly (n = 5861), representing cases, two or more gene models in Shae.V1 were merged into a single gene model for Shae.V2.

351
In contrast, 76 gene models in Shae.V1 were split into multiple models, representing a total of 352 178 genes in Shae.V2.

353
The level of completeness of the Shae.V2 gene set was determined by assessing the presence 354 of 978 BUSCO genes both in the genome (Fig. 4A, B; Table 2) and in the gene set (Fig. 4C, 355 D; short-read data, to achieve a substantially enhanced genome assembly for S. haematobium that 383 is comparable or even superior to those for related schistosome species (Figs 1 and 3 led to hundreds of merged or discarded gene models and, overall, to a reduced number of 407 predicted genes. 408 For the most recent S. mansoni gene set (WBPS11), both the average length of genes (21,785 409 bp) and number of genes (n = 10,131) are higher than for Shae.V2, suggesting a more complete 410 assembly and gene set. However, the length distribution of genes is comparable between the 411 two species, and contrasts that for Shae.V1, which shows a clear bias toward shorter genes 412 (Fig. 6). Furthermore, it is plausible that the size of the gene set and the average gene length 413 for S. mansoni are higher than for Shae.V2, because additional RNA-Seq data available for S. 414 mansoni (e.g., for the cercarial stage) provided evidence for minimally or selectively expressed 415 transcripts, thus facilitating the detection of novel gene models [26,27]. In the future, 416 additional RNA-Seq data from multiple developmental stages (including miracidia, sporocysts 417 and cercariae) for which data are currently unavailable, as well as long-read RNA-Seq data (cf.

418
[89]), should assist in the curation of gene models and the discovery of new transcripts for S. 419 haematobium. Another possible reason for a smaller inferred gene set might relate to the gene 420 transfer approach employed here [48, 49] that did not include de novo prediction of genes in 421 regions that previously did not have gene annotations. 422 In addition to the observed differences between the two most complete schistosome gene 423 sets (S. mansoni and now S. haematobium), we also detected a number of differences in the 424 associated genome assemblies (Fig. 3). For instance, S. haematobium scaffolds that contained 425 gaps (e.g., scaffolds 1, 134, 153 and 257) tended to align to multiple (n = 2-6) distinct S. 426 mansoni chromosomes, suggesting mis-assemblies. Similarly, there were scaffolds without 427 gaps in the S. haematobium assembly (e.g., scaffolds 109, 142 and 149) which corresponded 428 to multiple regions in distinct S. mansoni chromosomes that contained gaps, suggesting some 429 incorrect scaffolding in the S. mansoni assembly. However, in both cases, it is possible that 430 such regions do differ between the two species and are indeed the result of genome 431 rearrangements. Whether these discrepancies represent mis-assemblies or stem from genomic 432 rearrangement events could be the subject of comparative investigations using additional long-433 read sequencing in the future.

434
The goal here was to provide a high-quality genomic resource for S. haematobium, which 435 will enable in-depth gene (re-)annotation employing short-and long-read RNA-Seq data and, 436 more broadly, serve as a reference for functional and population genomics investigations of 437 schistosomes. Overall, despite some differences in gene numbers and scaffold synteny, the 438 BUSCO analysis presented here demonstrated and confirmed a step-change improvement in 439 contiguity for the S. haematobium genome assembly and for the gene set, compared with the 440 first draft (Shae.V1). Also, it provided evidence for an assembly quality that is comparable to 441 the best available genome for S. mansoni [27]. Achieving a chromosome-contiguous assembly 442 is the ultimate goal, which will provide substantial benefits to the research community, and 443 should underpin systems biological investigations and the discovery of new disease 444 interventions.

446
Availability of supporting data 447 The genome assembly and gene set are available from NCBI       Following the pre-submission enquiry by Professor Gasser, we were delighted to learn that Dr Scott Edmunds (Executive Editor) was supportive of us submitting the manuscript entitled "High-quality Schistosoma haematobium genome achieved by single-molecule and long-range sequencing" (by Andreas Stroehlein et al.) for publication as a Data Note in GigaScience, provided that we include: (i) a phylogeny of relevant flatworms, including S. haematobium, whose genomes are publicly accessible; and (ii) a picture of S. haematobium.
We have now addressed this request and further enhanced the manuscript following Dr Edmunds' email (9 April 2019).
In this manuscript, we report a substantially enhanced genomic resource for S. haematobium, a carcinogenic flatworm that causes a neglected tropical disease chronically affecting > 100 million people worldwide. At the molecular level, little is known about the biology of this blood fluke and the pathogenesis of the disease that this parasitic worm causes. To support molecular studies of this worm, we systematically enhanced the draft genome of S. haematobium using a single-molecule and longrange DNA sequencing approach. We have achieved a major improvement in the accuracy and contiguity of the genome assembly, making it superior or comparable to the best-quality assemblies available for a small number of related schistosome species. Using improved gene annotation pipelines, we inferred a gene set with as many or more complete gene models compared with those of the other well-studied schistosomes.
As you well know, the quality of a genome assembly has a substantial impact on subsequent analyses, in particular gene annotation and the calling of single nucleotide polymorphisms (SNPs). In this context, the present, improved genomic resource will clearly accelerate systems biological research of S. haematobium and related schistosomes, by enabling in-depth gene (re-)annotation and by serving as a solid reference for functional and population genomic investigations. Ultimately, progress in these areas will underpin the search for new disease interventions. We believe strongly that our manuscript fits the scope of GigaScience and that the present data set and findings will be a highly significant resource for the research community working on schistosomes and a wide range of other flatworms. We hope that you are as excited as we are about this contribution. We thank you in advance for considering and handling our manuscript. We look very much forward to the reviewers' reports. All authors have read and approved the R0-version of this manuscript. No part of this manuscript is under consideration, or has been submitted or published elsewhere, and none of the authors have any conflict of interest.