Mutational profile of the regenerative process and de novo genome assembly of the planarian Schmidtea polychroa

Abstract Planarians are organisms with a unique capacity to regenerate any part of their body. New tissues are generated in a process that requires many swift cell divisions. How costly is this process to an animal in terms of mutational load remains unknown. Using whole genome sequencing, we defined the mutational profile of the process of regeneration in the planarian species Schmidtea polychroa. We assembled de novo the genome of S. polychroa and analyzed mutations in animals that have undergone regeneration. We observed a threefold increase in the number of mutations and an altered mutational spectrum. High allele frequencies of subclonal mutations in regenerated animals suggested that most of the cells in the regenerated animal were descendants of a small number of stem cells with high expansion potential. We provide, for the first time, the draft genome assembly of S. polychroa, an estimation of the germline mutation rate for a planarian species and the mutational spectrum of the regeneration process of a living organism.


Introduction
Living organisms differ significantly in their capacity to regenerate damaged or lost tissues.Some organisms, like planarians, are capable of rebuilding a complete body from a small number of cells, while others, like Caenorhabditis elegans cannot replace damaged tissues at all.In mammals, regeneration is restricted, with the liver being particularly efficient in eliciting tissue repair response, but also other tissues have a limited potential to regenerate ( 1 ).The process of regeneration requires a complex orchestration of basic biological processes: proliferation, differentiation, and cell death, in a very similar way as during embryogenesis or in pathological conditions like carcinogenesis.A physiological process like embryogenesis is not cost free at the level of the genome of proliferating cells.Mutation rates of fetal tissues can exceed fivefold the mutation rates of the same tissues over a year period in an adult ( 2 ).In fact, mutations acquired during embryogenesis are a risk factor for development of cancer later in life ( 3 ).
Easier accessibility to whole genome sequencing has made it possible to study mutational spectra of physiological and pathological processes in great detail over the last decade .Human de novo germline mutation rates and their spectra have been determined ( 4 ), as well as somatic mutation spectra of specific tissues across different mammalian species ( 5 ) or somatic mutations in collections of cancer tissues ( 6 ).All these processes leave mutational 'scars' at the genome level, with smaller or greater consequences to the functioning of an organism.In a lifetime of an individual, tissues will be exposed multiple times to mechanical, chemical or in some cases surgical insults, that elicit tissue repair response, but the potential mutational burden of this process has not been assessed yet.To answer this question, we chose the most efficient and accessible model system for regeneration studies -the planarian.Regeneration in planarians is an extremely fast and accurate process which can produce a new individual from a small piece of a tissue, amputated from any body part.Abundance of pluripotent somatic stem cells (traditionally referred to as neoblasts) and a complex system of positional information underlie this process, which gained a solid mechanistic framework in the last two decades (7)(8)(9).It takes only about seven days for a fragment of an animal to regenerate missing tissues, increasing numbers of proliferating stem-cells fivefold within 48 hours ( 10 ).
We have previously published a pipeline based on an algorithm -IsoMut -for the analysis of unique mutations in isogenic whole-genome datasets ( 11 ).IsoMut was used successfully to elucidate the mutational signature of cisplatin, and the mutational profiles of several other chemotherapeutics or genetic treatments, in the context of oncological studies (12)(13)(14)(15).Here, for the first time, we applied it to a whole organism-the planarian Schmidtea polychroa , to study the mutation rate of a physiological process, regeneration.S. polychroa is surprisingly easy to maintain and propagate in the lab, but its main advantage for the scope of this study was the ability to reproduce by parthenogenesis-which satisfies the requirement for the isogenic background.S. polychroa is a cross-fertilizing, pseudogamous hermaphrodite, meaning that the sperm of one animal activates embryogenesis of the egg of another animal, but the genetic material of the first animal is excluded from the zygote ( 16 ).Oocytes of these animals pass through a premeiotic doubling of the triploid genome, followed by an asynaptic meiosis without effective recombination of the genetic material ( 17 ).Though S. polychroa reproduces mainly by parthenogenesis, occasional sex has been observed in this species at very low frequency ( 18 ).S. polychroa is not a new model in regeneration studies.Several versatile and interesting biological questions have been addressed in this system.Mechanisms of planarian embryogenesis, unique among metazoans and characterized by a divergent gastrulation stage with temporary feeding structures, were studied in S. polychroa ( 19 ).Another notable aspect of the physiology of S. polychroa is the absence of aging, as measured by mass-specific metabolic rates ( 20 ).Evolutionary studies have addressed the intricate question of successful survival of a parthenogenetic species, which is rescued by rare, occasional sexual reproduction ( 18 ).
Though S. polychroa shows a number of advantages as a model system, the main drawback remains the lack of a published reference genome.To bridge this gap and to answer whether regeneration is associated with a genome-wide increase in DNA mutations, we set out to sequence and assemble the genome of S. polychroa .Using protocols for isolation of high-quality planarian genomic DNA, PacBio longread sequencing, followed by de novo genome assembly, we provide the first draft genome of S. polychroa .We describe the structure of the genome and validate it through comparison with the well characterized species Schmidtea mediterr anea ( S. mediterr anea ) .Finally, whole genome sequencing of parent-regenerant-parthenogenetic offspring trios allowed us to estimate the de novo mutation rate of S. polychroa , calculate the mutational burden and spectrum associated with the regeneration process, and show that stem cells contribute differentially to regeneration.

Animal husbandry
S. polychroa was a kind gift of the Aboobaker lab and the asexual biotype of S. mediterranea was a kind gift of the Rink lab.Planarians were grown in 0.05% Instant Ocean at 20 • C and fed organic calf liver once per week.Animals were starved two weeks at least before genomic DNA extractions or one week at least before other experiments.

Genomic DNA extraction and sequencing
For long-read PacBio sequencing we used the previously published protocol for planarian genomic DNA extractions ( 21 ).For Illumina sequencing of single animals, we introduced a slight modification in the extraction protocol.Briefly, animals were treated with 0.5% N -acetyl-l -cysteine (NAC) pH 7.5 in agitation to strip off the mucus.NAC was replaced with a lysis buffer: 4 M guanidinium thiocyanate, 25 mM sodium citrate, 0.5% (w / v) N-lauroylsarcosine, 7% v / v βmercaptoethanol and animals were swiftly disrupted with a motorized pestle for several seconds.This allowed for higher yield of genomic DNA without shearing.Animals of average size: 1.5-1.8cm, were lysed in 1 ml of lysis buffer.The protocol further follows that of Grohme et al., including the post-purification step of mucopolysaccharide removal with cetyltrimethylammonium bromide ( 21 )

Genome size estimation by FACS
In total five samples containing six animals, each were collected for genome size estimation.Schmidtea mediterranea was used as a reference sample of known genome size.Frozen samples were homogenized briefly in Galbraith buffer (45 mM MgCl 2 , 30 mM Sodium citrate, 20 mM 3-( N -morpholino) propanesulphonic acid (MOPS), 0.1% v / v Triton X-100) with a motorized pestle.Homogenized samples were passed sequentially through a 70 μm and then 40 μm mesh.Nuclei were pelleted by centrifugation, resuspended in 1 × PBS with 50 μg / ml propidium iodide, 200 μg RNAse A, and incubated for 2 h on ice, prior to data acquisition.Flow cytometry data were acquired on BD FACSVantage SE (Becton Dickinson) and analyzed using FlowJo software Macintosh version 8.1.1 (Tree Star).

RNA extraction and sequencing
Total RNA was extracted as described previously ( 22 ), with a modified post-purification step.Namely, after Trizol extraction and DNAse treatment, RNA was ethanol precipitated to avoid loss of short transcripts.4 animals of different size were used for extraction of total RNA.RNA quality was estimated by NanoVue.Strand-specific library preparation and sequencing (100 bp, paired-end) was done by BGI (Hong Kong, China) using DNBSeq.

De no v o genome assembly
The draft genome was constructed from PacBio long reads using Canu v1.8 ( 23 ) with parameters optimized for relatively low coverage ( correctedErrorRate = 0.105 , corMin-Coverage = 0 , corMhapSensitivity = 'high', corOutCoverage = 10 000 , corMaxEvidenceErate = 0.15 ).Three rounds of polishing were performed by Arrow 2.3.3 ( https://github.com/ PacificBiosciences/ pbbioconda ) followed by one polishing round with Pilon 1.23 ( 24 ), both with default parameters.The Illumina dataset used for Pilon polishing was error corrected by Quake 0.3 ( 25 ) with a k -mer value of 19.Redundant contigs were removed using a multi-step purging strategy.First we ran the purge_dups pipeline ( 26 ), utilizing Illumina reads aligned with minimap2 ( 27 ) and using coverage cutoff values of 5, 40 and 70.Next, further haplotig reduction and first-level scaffolding using Illumina and PacBio reads was achieved by running all three steps of redundans.py( 28 ).The remaining haplotigs were removed using a custom script.Briefly, first we performed an all-to-all mapping using minimap2 with parameters -x asm5 -D -n 18 -secondary = no and filtered the resulting alignments that were not one-to-one mappings, had a secondary chain score larger than half of the primary chain score, had > 60% divergence in the mapped region, had mapping qualities < 60 or that were overlapping with one single repeat element on > 40% of the aligned length.Next, we removed the filtered hits (which are present on two different scaffolds) from the shorter containing scaffold.Final scaffolding was done using the PCFL de novo transcriptome assembly with L_RNA_Scaffolder ( 29 ).As the last step, we removed all scaffolds shorter than 1 kb, having a GC ratio over 40% or an average Illumina coverage < 10.All scaffolding rounds were followed by gap closing with GapCloser 1.2 ( 30 ), resulting in the final draft genome referred to as bm_Spol_g1.

Transcriptome assembly
RNA-sequencing reads were assembled with Trinity ( 31 ) using default parameters.The raw de novo transcriptome assembly was annotated by the Trinotate ( 32 ) pipeline.The proba-ble protein coding fraction of the transcriptome assembly was generated by filtering the raw transcripts for the presence of BLASTX hits against the Uniprot / Swissprot database, or the presence of at least one Pfam domain, or a minimal length of 100 bp.Gene annotations were determined using the raw transcriptome assembly by GMAP ( 33 ).For some analyses, we further filtered the transcriptome dataset to obtain the longest isoforms of each gene (PCFL set).

Genome characterization
Genome size and completeness was estimated using four methods: k -mer based estimations were obtained using GenomeScope2 ( 34 ); the coverage-based calculation used the Lander-Waterman equation ( G = LN / C , where G is genome size, L is read length, N is the number of reads and C is the genome-wise average coverage); we also utilized BUSCO ( 35 ) with the eukaryota_odb10 and the metazoa_odb10 datasets.Cellular ploidy was determined by observing the allele frequencies of germline mutations present in all analysed animals.Repetitive element families were predicted by Repeat-Modeler 2.0.1, and the resulting repeat families were detected in the draft assembly by RepeatMasker 4.0.9(Smith, A., Hubley, R. and Green, P. (2013-2015)) both in the case of the bm_Spol_g1 assembly and the D. japonica assembly ( 36 ) downloaded from NCBI BioPoject PRJNA580305.Genes were annotated by remapping the transcripts in the PCFL set onto the final genome assembly by GMAP ( 33 ).Multi-mapper, duplicate and chimeric transcripts were manually curated by their overlaps with repeat elements, by the presence of spliced-leader trans-splicing leader sequences as determines by the SLIDR / SLOPPR pipeline ( 37 ), and the lengths and relative positions of the multiple mapping sites.The scaffold containing the mitochondrial genome was manually curated, and mitochondrial genes were annotated with the MITOS webserver ( 38 ).Large-scale genome alignments were performed by lastz ( 39 ) using the parameters -step = 20 ,nogapped and -notransition .Synteny blocks between S. polychroa and S. mediterranea were determined by mapping the PCFL transcript sets of both species onto their genome assemblies (bm_Spol_g1 and dd_Smes_g4, respectively) by GMAP, and extracting homology information from the PlanMine API ( 40 ).To ensure that only clear homologies were used, we only selected orthologous pairs with one-to-one homology, and only considered those S. polychroa transcripts that were masked on < 10% of their length, had a mapped length of more than 300 bp, and were not inside another gene.The dd_Smes_g4 scaffolds were mapped using minimap2 onto the chromosome-level assembly of S. mediterranea ( 41 ).

Mutation calling
Illumina sequencing datasets from the parental, regenerant and filia animals were aligned against the bm_Spol_g1 draft reference genome with BWA mem ( 42 ).Lineage-specific mutations (i.e.pre-existing events shared between a parent animal and its progenies) were found by the HaplotyeCaller tool from the GATK 3.8 package ( 43 ).De novo mutations in filia were found with IsoMut ( 11 ) and HaplotypeCaller.Haplo-typeCaller hits were postfiltered in each case by extracting the number of supporting reads in each potential heterozygous site using the pysam Python library, and only those events were retained that had > 7 supporting reads in the candidate samples, but no more than 1 read in any other, having a mean coverage of 25-130 × in the assessed samples, and had an average mapping quality over 40.IsoMut was initially ran with permissive settings ( min_sample_freq = 0.15 , min_other_ref_freq = 0.9 , cov_limit = 20 ), and potential hits were filtered to have at least seven supporting reads and coverage between 25 × and 130 ×.All candidate mutations were manually verified in the Integrative Genome Browser 2.12 ( 44 ).

Validation of mutations
Subclonal mutations detected in regenerant animals were validated by PCR.Only those candidate events were considered that were positioned in uniquely mappable ±250 bp regions as assessed by BLASTn.Primer pairs were designed to amplify regions surrounding each of the unique mappable events ( Supplementary Table S10 ) by PCR, and the resulting amplicons were mixed and sequenced on an Illumina MiSeq instrument (Cogentech, Milan, Italy).The resulting 2 × 150 bp paired-end reads were trimmed with cutadapt ( 45 ) to remove adapters, merged with FLASH2 ( 46 ) using parameters -M 150 and -x 0.05 , and aligned against the bm_Spol_g1 assembly with BWA mem.Allele frequencies of mutations were determined manually by inspecting the mutation sites in the Integrative Genome Browser.

De no v o assembly of a no vel planarian genome
Adult S. polychroa animals range 1-2 cm in size.Their epidermis is of dark brown pigmentation and adult animals develop multiple photoreceptors (Figure 1 A).
To obtain high-quality genomic DNA, required for longread PacBio sequencing, we isolated genomic DNA from 20 animals according to a previously published protocol for the extraction of genomic DNA from S. mediterranea ( 21 ).For the assembly of the draft genome, we generated 18.04 Gb of PacBio long reads with an average read length of 15.24 kb and 39.42 Gb of 150 bp paired-end Illumina reads.As planarian genomes are generally GC-poor, we assembled the long reads with Canu using parameters optimized for lower coverages and skewed GC-ratios.After polishing with Illumina and PacBio reads, scaffolding and haplotig purging (Figure 1 B, Supplementary Figure S1 ), our draft genome, referred below as bm_Spol_g1, consisted of 8253 scaffolds with an overall assembly length of 434.3 Mb and an N50 value of 80.7 kb.
Wild populations of S. polychroa represent at least four coexisting biotypes ( 16 ) with four chromosomes, different ploidy levels and experimentally determined haploid genome sizes in the range of 0.56-1.29 Gb (Gregory, T.R. ( 2023).Animal Genome Size Database.http://www.genomesize.com.) To characterize the biotype of our animals, we estimated the true genome size using several approaches (Figure 2 A, Supplementary Table S1 ).The average haploid genome size was 587.9 Mb, indicating that our 434.3Mb draft assembly was ∼74% complete.In addition, we measured genome size experimentally by flow cytometry ( 47 ), comparing the fluorescent signal of propidium-iodide stained nuclei of S. polychroa to that of a species of a known genome size, S. mediterranea .588.67 Mb obtained by flow cytometry correlated very well with the theoretical value of 587.9 Mb (Figure 2 B, Supplementary Figure S2 , Supplementary Table S2 ).To confirm the ploidy level of the biotype used in the study, and to exclude the possibility of paternal contribution to the genetic material through occasional sex, we performed an analysis of allele frequency distribution of high-confidence germline mutations.It has been reported that presence of tetraploid offspring can be a good estimator of occasional sex rate in S. polychroa ( 48 ).The distribution of allele frequencies of germline mutations showed peaks at 33% and 66% (Figure 2 C), indicative of a triploid genome, which was also supported by the presence of three peaks on the KAT profile of the reference genome ( Supplementary Figure S1 B).This was confirmed for all animals used in this study ( Supplementary Figure S3 ), indicating no paternal contribution and a stable triploidy in all of the animals.
The draft genome of S. polychroa contained 57.98% of repetitive elements ( Supplementary Table S3 ).Indeed, planarian genomes are known to have high levels of repetitive DNA content ( 21 ).The distribution of different genomic elements showed a profile differing from vertebrate genomes and had an especially high level of DNA elements and unclassified repeats, similar to the repeat content of S. mediterranea ( 21 ) and of Dugesia japonica ( 36 ) (Figure 2 D).The high prevalence of repetitive elements contributed to the fragmentation of the draft genome: elements longer than 2 kb were overrepresented at the ends of scaffolds (Figure 2 E).
Next, we assembled the mitochondrial genome of S. polychroa ( Supplementary Figure S4 ).We detected non-canonical start codons, with TTG being used in 6 out of 13 mitochondrial protein coding ORFs.The gene order in the mitochondrial genome of S. polychroa is highly conserved with respect to the other members of triclads.The genome is interspersed with long non-coding regions that increase the length of the mitogenome to a surprising 25 534 bp.Similar features of mitochondrial genomes have been described in other Platyhelminthes ( 49 ).

The genomic architecture of S. polychroa
To have an independent measure of the completeness of our genome assembly, we generated a de novo transcriptome assembly using RNA isolated and sequenced from multiple S. polychroa individuals; we refer to this assembly as bm_Spol_tr1 ( Supplementary Table S6 ).We evaluated several quality control measures of the bm_Spol_tr1 transcriptome and found that the assembly contains the information of over 98% of the RNA-seq reads ( Supplementary Figure S5 B), and in terms of universally orthologous genes across species (BUSCO content) and number of genes / transcripts it is comparable to the already existing dd_Spol_v4 transcriptome assembly available on PlanMine ( 40 ) ( Supplementary Figure S5 C, D).To enrich for high-confidence transcripts, we selected the longest isoforms of the probable coding fraction (PCFL set) of the transcriptome, and these were annotated using the most similar transcripts in the dd_Spol_v4 transcriptome and the Uniprot / Swissprot database per BLAST, and the homologous S. mediterranea transcripts, which were taken from PlanMine.
Transcripts were back-mapped onto the bm_Spol_g1 genome assembly with GMAP ( 33 ) (Figure 3 A).For the backmapping, we used only the longest isoforms of the probable coding fraction (PCFL set) from the bm_Spol_tr1.Out of 25054 transcripts in the PCFL set, 17735 transcripts were mapping at a single locus, 4960 were not mapped by GMAP, in agreement with our previous estimate of > 80% completeness of the bm_Spol_g1 assembly, and the remaining transcripts were mapped as duplicates, at multiple positions or in two chimeric fragments; these transcripts were further analysed manually ( Supplementary Table S4 ).Almost all multi-mapper transcripts overlapped with RepeatModeler models, suggesting that these are mostly transcribed repeat elements.Importantly, only 37 out of the 25054 PCFL transcripts were remaining unpurged haplotigs on two different scaffolds, and upon visual inspection these were never longer than a few kbps.In accordance with the analysis of k-mer abundances ( Supplementary Figure S1 B), this proves that heterozygous regions were appropriately collapsed.Interestingly, among the chimeric transcripts we identified those caused by splicedleader trans-splicing (12.5%).Trans-splicing used a splicedleader sequence ( Supplementary Figure S6 ), encoded by six spliced-leader genes in several clusters as assessed by SLIDR ( 37 ), and both the sequence and the secondary structure of this non-coding RNA was similar to spliced-leaders identified in S. mediterranea ( 50 ).We also repeated the back-mapping analysis with a limited set of conserved transcripts from the PCFL set of bm_Spol_tr1, i.e. those that had a close BLASTn match among the transcripts in dd_Spol_v4, with these matches having homology in PlanMine to a S. mediterranea transcript ( Supplementary Figure S5 E, Supplementary Table S5 ).This restricted transcript list would enrich high-confidence genes, conserved in both Schmidtea species, thus enabling a less noisy characterisation of assembly quality.Indeed, this repeated GMAP analysis revealed significantly fewer missing, double-mapping and multi-mapping transcripts (73.7% versus 83.3% single mappers among the full and restricted PCFL sets, P = 3.15 × 10 −13 , χ 2 test), supporting our previous conclusions about the quality of the bm_Spol_g1 assembly.
Next, we tested the quality of the described genomic architecture of S. polychroa through a comparison with a reference genome of S. mediterranea .Interestingly, large-scale alignment between the two species with lastz ( 39 ) showed that sequence-based identities are restricted mainly to coding exons, so we decided to use gene order and positioning to find homologous regions in the two planarian species.Using GMAP, we mapped the PCF sets from the transcriptomes of both species (dd_Spol_v4 and SMEST.1, respectively) against the bm_Spol_g1 and the dd_Smes_g4 ( 21 ) assemblies, respectively, and used homology information from PlanMine ( 40 ) between the two Schmidtea species to predict synteny regions using only one-to-one homologies (Figure 3 C).Altogether, 86% of S. polychroa scaffolds (3823 / 4434) with identified S. mediterranea orthologous genes were linked only to one S. mediterranea scaffold, suggesting direct synteny between the regions ( Supplementary Figure S7 A of longer synteny regions was limited by the remaining fragmentation of the bm_Spol_g1 assembly.We also identified S. polychroa scaffolds with multiple links to S. mediterranea scaffolds ( Supplementary Figure S7 B).These events either mark potential rearrangement sites between the two species, or erroneous scaffolding.We manually evaluated break sites where at least two genes were on both sides mapping to different scaffolds in S. mediterranea ( Supplementary Figure S7 C, Supplementary Table S5 ).22 such rearrangement sites were real rearrangements between the two species with reciprocal translocations and large-scale inversions (Figure 3 D).75 break sites represented poor scaffolding.Interestingly, 49 potential break sites mapped to the very ends of two different S. mediterranea scaffolds.Comparison of these regions to a more recent chromosome-level assembly of S. mediterranea ( 41 ) showed that the sequences of these dd_Smes_g4 scaffolds are indeed positioned adjacently ( Supplementary Table S7 ), giving an example how the comparative genome analysis of two close relative species can contribute to more complete assemblies.

De no v o mutation rate in S. polychroa
De novo mutation rates show similarities across living organisms; however, de novo mutation rate of any planarian species has not been established yet.An effort to measure a number of lineage-specific mutations extracted from transcriptomes of two D. japonica strains has been made previously ( 51 ), without estimation of mutation rates.To determine de novo muta-tion rates in S. polychroa , we took advantage of the parthenogenetic reproduction strategy of the triploid biotypes of this species, as only the maternal germline will contribute to the genetic content of the offspring.This way, any heterozygous mutation detected in an offspring, but not found in its parent, will represent the mutagenic events that occurred in time, between the parental zygote and the zygote of the offspring (Figure 4 A, control group).This scenario is conceptually similar to the role of single cell cloning steps in induced mutation studies ( 12 ).
To find de novo mutations, we started from a population of four adult S. polychroa (parental generation P C ) and collected 4 of their hatchlings that developed into adulthood (filial generation Fc; Figure 4 A).Genomic DNA was isolated from both the parents and the filia .After whole genome sequencing, the resulting genomic DNA datasets were aligned against the bm_Spol_g1 draft genome, and point mutations and short indels were called in parallel by IsoMut ( 11 ) and Haplotype-Caller ( 43 ).The two methods are different, yet complementary.While IsoMut is a ploidy-agnostic mutation caller that only considers the number of supporting reads in a given sample, and efficiently filters out noise by concurrent analyses of many samples, HaplotypeCaller is a program often used on cancer samples with chaotic ploidy, that works by performing local re-assemblies around potential mutations.Mutation calling, by two different algorithms, allowed for more scrupulous detection of real mutations.
As the animals were kept in a shared vessel, it was unknown which offspring belonged to which parent.By analysing  B ack-mapped transcripts on S. poly chroa scaff olds scaff old_211 and scaffold_262 (orange ribbons) uncover a reciprocal translocation between to S. mediterranea scaffolds dd_Smes_g4_166 and dd_Smes_g4_33 .Some mapped genes on scaffold_262 are also linked to a third S. mediterranea scaffold, dd_Smes_g4_152 : here the exact breakpoint cannot be resolved, because rele v ant S. poly chroa transcripts in the region are not mapped on the bm_Spol_g1 assembly (purple ribbons).
shared, lineage-specific mutations between parents and the filial generation, we identified that the four parent animals (Pc1-Pc4) produced 1, 1, 0 and 2 offspring, respectively (Figure 4 C, Supplementary Table S8 ).The filia were named F C 1, F C 2, F C 4A and F C 4B, with the numbers referring to the respective parent, and the A / B notation marking siblings from the same parental animal.The filial generation contained on average 5 ± 2.94 (mean ± SD) unique clonal SNVs and 1.25 ± 1.5 unique clonal indels that were not detectable in the respective parents (Figure 5 A, Supplementary Table S9 ): these clonal mutations were not found in the parental zygote but were universally present in all cells of the offspring, and thus represent the total mutation load of a generation.By considering only mutations positioned in the uniquely mappable fraction of the assembly, having a length of 303.93 Mb, and a triploid genome, the de novo rates amount to 1.31 × 10 −8 per base pair per generation for SNVs, or 1.72 × 10 −8 per base pair per generation overall.To our knowledge this is the first report of de novo mutation rate for a planarian species.

Regeneration induces additional mutations in filial generation
During regeneration, germ cells arise from the same pool of stem cells as somatic cells ( 52 ), suggesting that patterns of mutagenic consequences of whole-body regeneration will also be captured as changes in the genome of the filial generation, in addition to de novo mutations accumulated during the parthenogenetic reproductive cycle.We expanded our previous experimental set up (Figure 4 A) and started from four different parental animals (P R 1-P R 4) that underwent an amputation step prior to parthenogenesis.Most of the animal body was amputated, including ovaries, and only a tip of the tail was left to regenerate into a new individual (Figure 4 B).When fragments completed regeneration and developed into adults (R1-R4), they were allowed to lay eggs.Four offspring animals were collected and together with the respective regenerated and parental animals, were used for DNA extraction and whole genome sequencing.Shared mutations in the 12 samples uncovered the relations between the animals (Figure 4 C) and assigned a regenerated animal (R) and a parent (P R ) to each of the offspring animals.We discovered that while F R 3 / R3 had no descendants among the filia , F R 4 / R4 had two offsprings, so the filial specimens were named F R 1, F R 2, F R 4A and F R 4B.Similarly to the F C animals, we found clonal heterozygous somatic mutations in the genomes of the F R animals with a typical allele frequency of 1 / 3 ( Supplementary Table S10 ).Very few of these affected coding regions ( Supplementary Table S9 ), altogether making it unlikely that they had any effect on the life of the animals.When compared with the P R parent genomes, we found that the filial generation harboured 13.5 ± 2.52 unique SNVs and 1.5 ± 0.58 unique indels.This implied that regeneration  caused a nearly threefold increase in the de novo SNV mutation rate in one generation as compared to parthenogenetic reproduction (Figure 5 A) ( P = 0.012, unpaired two-sided t -test).
The increase in the number of mutations can be ascribed solely to the regeneration process, as regeneration cycle is the only difference between the control and the regenerated group.Interestingly, the majority of mutations were shared in the siblings F R 4A and F R 4B.This suggested that these animals developed from germ cells in the R4 regenerant that were descendants of the same stem cell lineage, which probably underwent a high number of cell divisions during the regeneration process prior to germline specification.The same finding also confirmed that the detected clonal filial mutations were not generated during embryogenesis.
The point mutation spectra of the de novo mutations in control versus regenerated filial generation showed a marked difference between the two groups (Figure 5 B).SNVs in control animals had a broad spectrum, while in the regenerated filial generation an elevated number of C > T and T > C transitions could be observed.We compared the de novo spectra of control and regenerated animals to the spectrum of lineagespecific germline SNPs, those that were shared by all animals included in this study.We found that the profile of SNPs shared by all animals resembled more closely the de novo spectrum of the control group (Figure 5 B, C, Supplementary Figure S8 ) than that of the regenerated group, implying that regeneration had no contribution to the spectrum of germline mutations in studied S. polychroa population, in accordance with previous observations that S. polychroa does not reproduce by fission.Confirming this established fact, through comparison of mutational profiles, provides validation for our mutational analysis tools.A general source of C > T mutations in living organisms is methyl-CpG deamination, which is due to the higher propensity of methylated cytosines in the CpG context for spontaneous hydrolysis, resulting in thymine bases at the site ( 53 ).Though DNA methylation has been detected across Platyhelminthes ( 54 ), no measurable 5-methylcytosine levels have been found by the use of methylation sensitive restriction enzymes or antibodies against 5-methylcytosine in the planarian S. mediterranea ( 55 ).We applied a genome wide approach to test whether cytosines are methylated in S. polychroa .The presence of CpG methylation can be deduced by genomic triplet frequencies: CpG containing triplets, being inherently less stable, are underrepresented in highly methylated genomes, like those of vertebrates ( 56 ).To determine the contribution of CpG methylation in the elevated rate of C > T mutations after regeneration, we compared the genome-wide triplet spectra of S. polychroa , S. mediterranea and Homo sapiens (Figure 5 D).In accordance with the low GC ratios of planarian genomes, GC-rich triplets are generally depleted, however, CpG-containing triplets were not particularly underrepresented (Figure 5 E), as in vertebrate genomes.The profile of planarian CpG-containing triplets was more similar to the non-methylating Caenorhabditis elegans , implying that CpG methylation was not present in S. polychroa .

Regenerated animals are mosaics for the regeneration-induced mutations
To understand better the origin of de novo mutations after regeneration, we tested whether genomic changes in the filial generation were pre-existing in their ancestors.In the filial generation, altogether half of the clonal mutations had highconfidence support in the corresponding regenerated animal (Figure 6 A, Supplementary Table S10 ), with allele frequencies in the range of 1.09-5.56%.As this set also included events supported by a lower number of reads, we validated these events by PCR and high-coverage amplicon sequencing.This approach confirmed the presence of most of these variants in regenerated animals, even in cases where the coverage depth of WGS was insufficient to detect them (Figure 6 B, Supplementary Table S10 ).The allele frequencies detected by the two methods showed a moderate correlation of 0.62.
The maximum mutation allele frequency (by both PCR and WGS) was 6%, meaning 18% of the cells of regenerant carried those mutations in case of a triploid genome.This raises the question whether some of the mutations in the regenerants were already present at the time of the tail amputation.WGS supporting reads were detected for a small number of regenerant subclonal mutations also in the parental animals (Figure 6 D).This indicates that stem cells carrying these mutations were distributed throughout the body of the parental animal, including the amputated tail.The mutation-carrying clone expanded in the amputated tail and became subclonal in the regenerant.However, this refers only to a smaller number of mutations.Most of the regenerant subclonal mutations arose during regeneration.Altogether, the results of the regenerationassociated mutational analysis suggested that only a limited number of stem cells was required to regenerate animals, thus we expected to see a low number of dividing stem cells in an amputated piece of tail.On the contrary, a large number of ∼250 dividing stem cells was detected immediately after amputation in tail fragments, as determined by immunofluorescence of phosphorylated H3 histone (Figure 6 E, F).H3P staining underestimates the actual number of stem cells in the fragment, as it labels only those cells that are dividing in a given moment.Taken together, these results suggested that from a large pool of stem cells detected in the amputated tail, only some will undergo large proliferative expansion and give rise to the descendent cells found in the regenerated animal.

Discussion
We present here the first draft genome assembly of S. polychroa .Using a strategy that integrates long-read PacBio and Illumina sequencing methods, we provided a relatively contiguous genome assembly on 8253 scaffolds.Using multiple theoretical methods and an experimental one based on flow cytometry, we estimated that the genome size of S. polychroa is around 588 Mb.Analysis of repetitive elements of S. polychora suggested that they were divergent from those in the human genome, but similar to that of other planarian species ( 21 ,36 ): we found a higher percentage of DNA transposons and unclassified DNA elements.DNA transposons tend to integrate in the proximity of coding genes, and can also be efficiently excised from the sites of integration, having an effect on the organization and expression of nearby loci ( 57 ).Increased proportions of transposable elements might give rise to new allelic forms, which could be an important mechanism of diversification in a species that reproduces primarily by parthenogenesis.It has been shown recently that transposable elements can increase the levels of heterozygosity in inbred populations ( 58 ).
We also provide a transcriptome assembly of S. polychroa.While 71% of transcripts mapped uniquely on the genome assembly, a fraction of transcripts mapped twice or multiple times on the genome, mostly due to repetitive elements, and some transcripts appeared to be chimaeras.Chimeric mappings can be signs of assembly fragmentation, but it is also notable that a relatively large fraction of chimeric transcripts marked spliced leader trans-splicing.This particular form of trans-splicing has been observed in a range of eukaryotes, including Platyhelminthes species ( 50 ,59 ), and S. polychroa could potentially be a useful system to study this process.
Despite the fragmented nature of the bm_Spol_g1 genome assembly, it will enrich existing genomic information in the genus of planaria, and potentially provide opportunities for comparative genomic studies with related species.
We estimated the de novo mutation rate of S. polychroa and described its associated mutational profile.Described germline de novo mutations do not necessarily have a uniform origin.Presumably, some mutations can arise in somatic stem cells and persist at the subclonal level prior to the germline specification, while others form in differentiated germline, during gametogenesis.Both type of these mutations eventually become a parental germline mutation and a clonal mutation in the offspring.For this reason, the reported de novo mutations represent the mutational load of one generation, and allow for the estimation of the de novo mutation rate.The profile of de novo mutations resembled the 'clock-like' SBS5 (single base substitution 5) signature in the COSMIC (Catalogue of Somatic Mutations in Cancer) database ( 60 ).This signature correlates with aging and is considered a background mutagenic component both in human cells ( 61 ) and various other Eukaryotes ( 62 ).We also described linage-specific germline mutations, common to all animals in this study.It would be interesting to understand the origin of these mutations and whether they are related to parthenogenesis as a mode of reproduction.
Most surprisingly, when animals passed through a single round of externally induced regeneration, the number of mutations increased three-fold and we detected an alteration of the mutational profile.This is the first estimation of genomic alterations related to the process of regeneration in any living organism.The caveat of this observation is that it was made on a limited number of sequenced samples and future investigations on a larger pool of samples and in different systems should provide support for the reported observation.The genome-wide analysis of mutation allele frequencies helped understanding the regenerative development of S. polychroa , with three notable findings.First, all of the clonal de novo mutations in filia were also detected in the regenerants, often with high (in cases, over 10%) contribution to the regenerant body.Most of these mutations arose during regeneration as they were not detected in the parent, and their spectrum differed from somatic mutations in control animals.Single stem cells therefore appear to have contributed to large fractions of the DNA content and cell number of the regenerated and fully grown animals.The ∼1% median regenerant allele frequency of clonal filial mutations implies a ∼3% typical contribution by each mutation-acquiring stem cell to the full body.Though the number of dividing stem cells in the amputated tail fragment was high, as indicated by the mitosis marker H3P, only a limited number of stem cells contributed to rebuilding the rest of the animal and its gonads.This suggests that intraindividual selective pressures act on single stem cells, an evolutionary concept that has been described previously ( 63 ).Certain clonal mutations in filia were present at similar, relatively high proportions in the cells of the regenerants and very few of them also in the parental animal.Small number of such mutations, detected in the parental animal, suggested that the mutation-carrying somatic lineages made similar contributions to the cut tail and the remaining body of the parent, implying that planaria develop in a mixed mosaic manner, with cell lineages created by early divisions and marked by early somatic mutations contributing in similar proportions to different parts of the body (Figure 6 G).It is also possible that the similar contributions of mutation-carrying stem cells to the body and the tail resulted from the proportional cutting of spatially separated somatic clones (Figure 6 H), but we consider this less likely due to the similar contributions of several mutations to P R and R animals.Though technically challenging, whole genome sequencing from several separate body segments, could decisively differentiate these possibilities.Somatic mosaicism has been described previously in planarians, analysing two genetic loci in animals with different reproductive strategies in the species Dugesia subtentaculata ( 64 ).
Finally, the observed mutation spectra were also revealing.The altered mutation spectrum upon regeneration confirmed that regeneration comes at a cost at the genome level.Importantly, this change in the spectrum was unique for regeneration, as control de novo mutations were more similar to the spectrum of the lineage-specific germline mutations, shared by all animals.We can only speculate as to the cause of the abundance of C > T and T > C mutations in the regenerated filia .The same mutation classes dominate the SBS spectra of mismatch repair deficient cancer cells alongside oxidative damage-induced C > A mutations, and the C > T and T > C transition mutations are thought to primarily result from polymerase errors ( 65 ).It is thus possible that fast cell proliferation during planarian regeneration is accompanied by a reduced efficiency of DNA mismatch repair, resulting in the altered mutation rate and spectrum.
Our results highlight certain parallels between planarian regeneration and mammalian biology.The similar contribution of cell lineages to the body and the tail of planarian can result from early embryonic cell mixing also evidenced by patterns of chimeric mammals, or by extensive cell migration also typical of mammalian development ( 66 ).Tissue regeneration is best seen in the liver of mammals, and the regrowth of relapsed tumours following surgery can also be considered an example of a regenerative process.Genome alterations accompanying these processes are relevant to subsequent tumorigenesis or the development of resistance, and the planarian system can be a useful model for understanding the costs of tissue repair at the genome level.

Figure 1 .
Figure 1.De no v o genome assembly strategy.( A ) Images of S. polychroa ; cocoon (top right), young (top left) and adult animal (middle), close up on the head region of the adult indicating multiple photoreceptors (bottom).( B ) Flo w chart of the de no v o genome assembly strategy.

Figure 2 .
Figure 2. Analysis of genomic features in the S. polychroa assembly.( A ) Estimation of genome size by multiple methods.Each bar represents a genome size estimation according to the indicated approach.( B ) Representative FACS profiles of S. polychroa and mediterranea cells, stained for their DNA content.( C ) Histogram of allele frequencies of germline mutations.The two peaks of the distribution at 33% and 66% refer to the major and minor alleles at a heterozygous site and indicate that the S. polychroa strain used in theis study was triploid.( D ) Distribution of repeat element classes in the genomes of H. sapiens, S. polychroa, S. mediterranea and D. japonica .( E ) Histogram of relative positions of repeat elements.Only DNA elements longer than 2 kb were considered, and scaffolds lengths were normalized to 1.The individual points refer to the number of repeat instances having their start (red) or end (blue) in the respective percentile of the normalized scaffold length.

Figure 3 .
Figure3.Genome characterization.( A ) Analysis of transcriptome back-mapping.Automatic annotation by GMAP was manually revised to enable finer categorization.The majority of abnormal transcript mappings are due to overlaps with repeat elements.( B ) A representative example of sequence-level alignment between the assemblies of S. mediterranea and S. polychroa .The gene str uct ures of the analyzed regions in the two species are indicated at the axes.The black dots refer to the found direct alignments as assessed by lastz, and the colored horizontal and vertical lines are showing the coding e x on positions.( C ) Scheme of synteny identification.S. polychroa and S. mediterranea trancripts (orange and blue ribbons, respectively) were mapped to the respective genome assemblies by GMAP (thin black arrows), and interspecies homologies were taken from PlanMine (thick purple arrows).( D ) A representativ e e xample of synten y regions betw een the tw o Schmidtea species.B ack-mapped transcripts on S. poly chroa scaff olds scaff old_211 and scaffold_262 (orange ribbons) uncover a reciprocal translocation between to S. mediterranea scaffolds dd_Smes_g4_166 and dd_Smes_g4_33 .Some mapped genes on scaffold_262 are also linked to a third S. mediterranea scaffold, dd_Smes_g4_152 : here the exact breakpoint cannot be resolved, because rele v ant S. poly chroa transcripts in the region are not mapped on the bm_Spol_g1 assembly (purple ribbons).

Figure 4 .
Figure 4. Experimental set up and phylogenetic relations between samples.( A ) Scheme of experimental design for the analysis of de novo mutations.For the control group (top half), parental animals (P C ) were left to reproduce by parthenogenesis ( n = 4) and progenies ( filia -F C ) were collected and sequenced.In the case of the regenerated group (bottom half), tails of parental animals, or P R ( n = 4) were amputated and left to regenerate and reach sexual maturity (R).Regenerants reproduced by parthenogenesis and the progenies (F R ) were collected for the analysis.The panel was created with bioRender.com.( B ) Schematic figure depicting the positions of the reproductive organs in the planarian body, and the approximate cut site.( C ) Phylogenetic tree showing the planaria lineages used in this study.The colored symbols show the initial information regarding the identities of each sample.

Figure 5 .
Figure 5. Analysis of de novo mutations.( A ) Counts of de novo mutations in progeny of regenerated animals versus progeny of control animals.Mutations shared by siblings were marked.Significance was assessed by unpaired two-sample t-test.( B ) Mutational spectra of de novo mutations.( C ) Comparisons of Pearson correlations between the spectra of regenerated and control de novo substitutions and germline mutations shared by all animals.( D ) B ar plot sho wing the frequencies of trinucleotides in the genome of S. poly c hroa, S. mediterranea and H. sapiens.( E ) Scat ter plots indicating the relationship between triplet frequencies and counts of G / C bases in each trinucleotide.Triplets containing CpG motifs are shown in red.

Figure 6 .
Figure 6.Analysis of pre-existing subclonal mutations in regenerants.( A ) Ratios of mutations found in filia that are also identified as subclonal in the corresponding regenerant.( B ) Allele frequencies of filia clonal mutations detected by amplicon sequencing in the corresponding regenerant animals.Mutations are colored by their supporting evidence from WGS: only present as heterozygous in filia (grey), present in filia and subclonally in regenerants (red).( C ) Scatter plot of allele frequencies of subclonal mutations detected by WGS versus by amplicon sequencing of PCR products.Mutations are colored as in (B).( D ) Allele frequencies of subclonal mutations in regenerants and parental animals, demonstrating that few mutations were already present in parental animals.( E ) Immunofluorescence for H3P marker of mitosis, on amputated tails equivalent in size to the fragments from which regenerant animals formed.( F ) Quantification of the fragment size and number of H3P positive cells in each fragment.(G, H) Possible ways of organization of stem cell clones in the planarian body, symbolized by colored markers.( G ) Different clones are spread uniformly across the body.( H ) Clones are spatially separated.Pie charts show the distribution of colored clones in the cut tail and the rest of the body.
. Purified DNA was resuspended in 1 × TE and DNA quality was assessed by NanoVue, Qubit and 0.7% agarose gel electrophoresis.Only samples with NanoVue readings 260 / 280 ∼ 1.8; 260 / 230 ∼ 2, minimal concentration (estimated by Qubit) of 25 ng / μl and no indication of DNA degradation on an agarose gel were sequenced.Library preparation and sequencing of Illumina datasets (2 × 150 bp, paired end) were performed by Novogene (Beijing, China) using Illumina NextSeq 2000 instruments, obtaining approximately 50 × genome coverage.Long read library preparation and sequencing was performed on PacBio Sequel machines by Novogene, resulting in 18.04 Gbp read length on average, with 20x genome coverage.