Genome Assembly of the Polyclad Flatworm Prostheceraeus crozieri

Abstract Polyclad flatworms are widely thought to be one of the least derived of the flatworm classes and, as such, are well placed to investigate evolutionary and developmental features such as spiral cleavage and larval diversification lost in other platyhelminths. Prostheceraeus crozieri, (formerly Maritigrella crozieri), is an emerging model polyclad flatworm that already has some useful transcriptome data but, to date, no sequenced genome. We have used high molecular weight DNA extraction and long-read PacBio sequencing to assemble the highly repetitive (67.9%) P. crozieri genome (2.07 Gb). We have annotated 43,325 genes, with 89.7% BUSCO completeness. Perhaps reflecting its large genome, introns were considerably larger than other free-living flatworms, but evidence of abundant transposable elements suggests genome expansion has been principally via transposable elements activity. This genome resource will be of great use for future developmental and phylogenomic research.


Introduction
Platyhelminthes (flatworms) are a phylum of protostomes related to annelids, mollusks, and other Lophotrochozoa; they are a very diverse phylum represented by both free-living (turbellarian) and parasitic species (Martin-Duran et al. 2012;Egger et al. 2015). They have received particular attention due in part to their parasitism but also to the remarkable regenerative abilities of many species. Members of most flatworm classes are unusual amongst Lophotrochozoa in that they display divergent embryogenic processes (notably blastomeren anarchie) that have captured the interests of evolutionary and developmental biologists (Martin-Duran et al. 2012;Egger et al. 2015). The canonical spiral cleavage, typical of many lophotrochozoan phyla, is only seen in the early diverging flatworm classes-Catenulida, Macrostomida, Lecithoepitheliata, and Polycladida. Ciliated larvae, comparable to those of annelids and mollusks, are even more restricted, being found only in the polyclads. The polyclad class is thus pivotal to understanding the starting point for the evolution of the divergent developmental modes in other platyhelminth classes and more generally for linking platyhelminth development to the wider context of the Lophotrochozoa (Egger et al. 2015).
Prostheceraeus crozieri (previously Maritigrella crozieri) is a species of polyclad flatworm found in the mangroves of Bermuda and the Florida Keys. The adults live on (and eat) colonies of the sea squirt species Ecteinascidia turbinata (Lapraz et al. 2013). Prostheceraeus crozieri is becoming a useful laboratory model polyclad and transcriptomes of different developmental stages exist; the species has been used to examine early spiral cleavage and larval development using micro-injection labeling techniques, 3D light sheet microscopy (Girstmair and Telford 2019), and gene expression in its Müller's larva using anti-body and in situ hybridization techniques (Rawlinson et al. 2019).
While previous work has resulted in an assembled de novo transcriptome (Lapraz et al. 2013), a genome is needed to enable comparisons with existing genomes of other free-living flatworms such as the laboratory models Schmidtea mediterranea (Grohme et al. 2018),

GBE
Macrostomum lignano (Wasik et al. 2015;Wudarski et al. 2017), and Dugesia japonica ) as well as those of the many parasitic species. Flatworm genomes are notoriously repetitive and challenging to assemble, but long-read sequencing has been used to improve assembly contiguity (Wudarski et al. 2017;Grohme et al. 2018).
We have used high molecular weight DNA extracted from a single individual and sequenced with PacBio technology to assemble a draft genome. The genome assembly and annotation will be a key resource for future studies involving this polyclad flatworm.

Results and Discussion
The Large Genome of P. crozieri High molecular weight DNA was extracted from a single, hermaphrodite P. crozieri adult and sequenced using PacBio and Illumina technologies,generating 11,921,195 PacBio reads with an N50 of ∼30 kb and 558,509,539 Illumina 150 bp paired-end reads, which FastQC identified high-quality reads throughout.
The initial assembly used Flye (Kolmogorov et al. 2019) to assemble PacBio reads to 2.26 Gb, with 26,131 scaffolds and an N50 of 261,667 bp. Polishing and purging of possible haplotype-associated duplicate scaffolds generally removed smaller scaffolds ( fig. 1A), reducing the final genome size to 2.07 Gb, with 17,074 scaffolds (16,926 scaffolds >1,000 bp) and increased the N50 to 292,050. The assembled genome has a GC content of 37.64% (table 1).
This assembled genome is larger than any other free-living flatworm genome known   (Wudarski et al. 2017;An et al. 2018;Grohme et al. 2018). The assembled genome size corresponds closely to a flow cytometry-based estimates of 2.5 Gb, indicating a ∼83% complete assembly (Lapraz et al. 2013). Kmer-based genome size estimates gave a smaller size of only 1.56-1.68 Gb genome size (supplementary table S1, Supplementary Material online), suggesting that Flye performed well despite issues with repeats presumably disrupting kmer-based size estimation. Kmer frequencies suggested diploidy, with two peaks occurring (fig. 1B) and predicted heterozygosity levels between 0.810% and 0.936% (supplementary table S1, Supplementary Material online).
The level of duplicate BUSCO genes in the initial assembly was 5.5% and, after polishing and haplotype purging, this was reduced to 2.7% (supplementary table S2, Supplementary Material online). In both assembly versions, the percentage of missing BUSCO genes was similar, at ∼13.5% (supplementary table S2, Supplementary Material online), indicating that haplotype-specific scaffold removal did not reduce genome completeness.

Highly Repetitive Genome
A total of 67.9% of the P. crozieri genome was identified as repeat, and this portion was masked. This level of repeats was high, but was anticipated given other highly repetitive flatworm genomes (e.g. S. mediterranea and D. japonica genomes have 61.7% and 80% repeat content, respectively) (Wasik et al. 2015;Wudarski et al. 2017;An et al. 2018;Grohme et al. 2018) and the predicted size of this genome. The percentage of repeat content was greater than S. mediterranea (61.7%), but less than the estimated 80% in D. japonica. While retroelements (10.19%) and DNA transposons (23.89%) like PiggyBac and hobo-activator, and SINE (Penelope) and LTR (Pao and Copia), and 1.62%

Significance
Flatworms are a major phylum of protostome animals showing enormous diversity, from free-living "turbellarians" to parasites including tapeworms, liver flukes, and schistosomes. Flatworm body plans and embryology have diverged considerably from the state seen in other protostomes, with many classes showing a unique form of early cleavage called "blastomeren anarchie". Only a few platyhelminth classes, including polyclads, have retained a canonical spiralian type of development and polyclads are the only flatworm class with both spiral cleavage and ciliated larvae comparable to an annelid or mollusk trochophore larva. While whole-genome sequences are available from several other classes of flatworm, we have sequenced the first genome of a polyclad. Our annotated genome will provide an essential resource for the further study of this developing laboratory model and will help us understand the evolution of flatworm genomes, embryology and body plans and allow us to make fruitful comparisons across the animal kingdom. GBE of other repeats (e.g. small RNA, satellites, rolling circles, simple repeats), were identified in the genome, the largest fraction of repeats was unclassified (32.3%).
There were many large repeat regions >10 kb, but small repeats were also abundant ( fig. 1C). Sequencing and assembly of other free-living flatworms has proved difficult due to the log ( Genome Biol. Evol. 14(9) https://doi.org/10.1093/gbe/evac133 Advance Access publication 30 August 2022 highly repetitive genomes and long repeats, and we also encountered assembly difficulties here, despite using PacBio long reads, likely due to high repeat content and long repeats.

Many Gene Annotations Have Large Introns
Braker2 (Bruna et al. 2021) was used to predict gene models and predicted a total of 43,325 genes, with 46,235 isoforms, which had an average length of 2,048 bp. The 23,852 of the 43,325 genes had transcriptional support >1 transcript per million in the RNAseq data.
InterProScan (Jones et al. 2014) identified 21,493 of the predicted genes with homology to Pfam domains and, of these, 12,199 were also supported by the existing transcriptome data. This suggests that Braker2 was able to recover gene predictions that had Pfam homology but which lacked RNAseq evidence. The BUSCO completeness of the annotated gene set (C:89.7% [S:87.1%, D:2.6%], F:5.2%, M:5.1%) was more complete than the genome assembly alone (supplementary table S2, Supplementary Material online).
We compared the length and GC content of exons and introns with other free-living flatworms (Zhu et al. 2009 1E). P. crozieri average exon GC content was 44.5% (higher than the genome GC of 37.64%), which was greater than S. mediterranea and D. japonica, but less than M. lignano ( fig. 1D). The GC of introns (37.4%) was very similar to the background P. crozieri genomic GC content ( fig. 1E).
Across all four species, a total of 5,428 Pfams were detected, with 3,233 being shared in all four species ( fig.  1G). We also asked how many genes were associated with each Pfam domain in the other available free-living flatworm genomes. The number of genes per Pfam domain was similar in P. crozieri, S. mediterranea, and D. japonica, but the macrostomid M. lignano had more instances of genes linked to each Pfam, supporting previous evidence of high levels of duplication in M. lignano ( fig. 1H) (Wasik et al. 2015;Wudarski et al. 2017). It is possible that the large number of specific orthology groups in M. lignano is associated with the divergence of these duplicated genes (Holland et al. 2017;Natsidis et al. 2021).
Many of the most frequently occurring Pfam domains in P. crozieri (rvt_1 [pf00078], rve [pf00665], piggybac [pf13843], and integrase [pf17921]), were also more abundant than the other flatworms ( fig. 1) and are associated with retroviral or transposable element genes. Taken together with the high proportion of repetitive elements, this could suggest that P. crozieri has a large number of active transposable elements. It is unclear whether the large intron sizes (when compared with other flatworms) are functionally related to the higher transposable element activity.
ParaHox genes (Cdx, Gsx, and Xlox/Pdx) have been lost (or not identified) in S. mediterranea (Currie et al. 2016); we identified Cdx and Gsx but not Xlox/Pdx in P. crozieri (supplementary table S3, Supplementary Material online). The Hox genes were not found in a single cluster, although two Hox9-13 genes were linked on a single scaffold, Cdx and Hhex were present on another scaffold and tandem duplicates of Otx on a third (supplementary table S3, Supplementary Material online). Low discovery of syntenic homeobox genes may be a result of a large, repeat-rich genome that is fragmented. The P. crozieri genome is considerably larger than other flatworms sequenced to date. However, given the complete repertoire of homeobox classes and high BUSCO completeness, the lack of extensive duplications of either homeobox or BUSCO genes suggests that there have been no large-scale or pervasive gene duplications in the lineage leading to P. crozieri.
Flatworms have lost most mammalian stem cell and pluripotency genes (Oct4/Pou5f1, Nanog, Klf4, c-Myc, and Sox2) however. Of these mammalian factors, only Sox2 homologs remain in S. mediterranea and M. lignano (Wasik et al. 2015;Grohme et al. 2018). Similarly, in P. crozieri, Sox2 was present in one copy, and none of the other factors were identified, despite its regenerative capabilities. Therefore P. crozieri like other flatworms, lacks the pluripotency genes commonly found in mammals, though further improvements in P. crozieri genome and annotation completeness may help to validate this observation.

Conclusion
We have assembled and annotated the first polyclad flatworm genome of P. crozieri attaining a 2.07 Gb assembly with 43,325 genes. The high repeat content of 67.9% was not unexpected based on other flatworm genomes. Despite the problems that these high repeat contents can cause in genome assembly, the high BUSCO scores we observed and the large homeobox repertoire suggest the assembly and annotation are of reasonable completeness and of a quality that will be useful for future studies. Our work helps elevate P. crozieri as an increasingly important model that will contribute to our understanding of flatworm and animal evolution.

Materials and Methods
Animal collection, DNA extraction, and sequencing P. crozieri adults were collected between Largo and Marathon Keys in the Florida Keys, USA (September/ October 2019), transported in sea water to UCL, UK, and transitioned to artificial sea water (ASW) and maintained in ASW for 4 weeks. DNA from one live adult was extracted following a standard soft tissue protocol from BioNano Prep Animal tissue DNA Isolation. Extracted DNA was stored at 4°C for 3 days before DNA concentration was estimated using NanoDrop and TapeStation technology. Approximately 10 μg of DNA was used for library preparation and sequencing with two SMRT SQII PacBio cells and shearing, library preparation, and 150 bp paired-end Illumina sequencing done at the University of California, Berkeley, CA, USA.

Genome Assembly
We use the repeat concatenated de Bruijn graph assembler Flye v2.7 (Kolmogorov et al. 2019) and the PacBio reads for an initial assembly with the genome size parameter set to 2.5 Gb (-g 2.5 g), 75× coverage for repeat graph construction (-asm-coverage 75) and a minimum overlap of 8,000 bp (-m 8000) to avoid an overly fragmented assembly. This was followed by one round of polishing with long reads using Flye (Kolmogorov et al. 2019 (Bolger et al. 2014) and long reads to polish using the -task = best strategy. The parameters for mini-map2 v2.17-r941 (Li 2018) for max depth of short reads was set to 35× coverage and for long reads -x map-pb, with a minimum read length of 5 kb, maximum read length 300 kb, and max depth at 60×.
Purge_dups v1.2.3 (Guan et al. 2020) further collapsed haplotype scaffolds (including parameter -e). We searched for BUSCO genes at each step of assembly and the final gene predictions. Busco v3.0.2 (Simao et al. 2015) was used with metazoan_odb9 with default evalue and "-long" for optimization of the Augustus parameters in genome searches.