A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning

ABSTRACT Background Diploid genome assembly is typically impeded by heterozygosity because it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution that exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes, which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism. Findings We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ∼1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked reads. Both assemblies are contiguous (mean scaffold N50: 8.2 Mb) and complete (mean BUSCO completeness: 97.3%), with annotations and 31 chromosomes identified through karyotyping. We used the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from 5 populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity. Conclusions We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling heterozygous genomes. Using our assembly, we provide genomic insights into the geographic population structure of A. plantaginis.


Full Title:
A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning Diploid genome assembly is typically impeded by heterozygosity, as it introduces errors when haplotypes are collapsed into a consensus sequence. Trio binning offers an innovative solution which exploits heterozygosity for assembly. Short, parental reads are used to assign parental origin to long reads from their F1 offspring before assembly, enabling complete haplotype resolution. Trio binning could therefore provide an effective strategy for assembling highly heterozygous genomes which are traditionally problematic, such as insect genomes. This includes the wood tiger moth (Arctia plantaginis), which is an evolutionary study system for warning colour polymorphism.

Findings
We produced a high-quality, haplotype-resolved assembly for Arctia plantaginis through trio binning. We sequenced a same-species family (F1 heterozygosity ~1.9%) and used parental Illumina reads to bin 99.98% of offspring Pacific Biosciences reads by parental origin, before assembling each haplotype separately and scaffolding with 10X linked-reads. Both assemblies are highly contiguous (mean scaffold N50: 8.2Mb) and complete (mean BUSCO completeness: 97.3%), with complete annotations and 31 chromosomes identified through karyotyping. We employed the assembly to analyse genome-wide population structure and relationships between 40 wild resequenced individuals from five populations across Europe, revealing the Georgian population as the most genetically differentiated with the lowest genetic diversity.

Conclusions
We present the first invertebrate genome to be assembled via trio binning. This assembly is one of the highest quality genomes available for Lepidoptera, supporting trio binning as a potent strategy for assembling highly heterozygous genomes. Using this assembly, we provide genomic insights into geographic population structure of Arctia plantaginis.

Background
The ongoing explosion in de novo reference genome assembly for non-model organisms has been facilitated by the combination of advancing technologies and falling costs of next generation sequencing [1]. Long-read sequencing technologies further revolutionised the quality of assembly achievable, with incorporation of long reads that can span common repetitive regions leading to radical improvements in contiguity [2]. However, heterozygosity still presents a major challenge to de novo assembly of diploid genomes. Most current technologies attempt to collapse parental haplotypes into a composite, haploid sequence, introducing erroneous duplications through mis-assembly of heterozygous sites as separate genomic regions. This problem is exacerbated in highly heterozygous genomes, resulting in fragmented and inflated assemblies which impede downstream analyses [3,4]. Furthermore, a consensus sequence does not represent either true, parental haplotype, leading to loss of haplotype-specific information such as allelic and structural variants [5]. Whilst reducing heterozygosity by inbreeding has been a frequent approach, rearing inbred lines is unfeasible and highly time consuming for many non-model systems, and resulting genomes may no longer be representative of wild populations.
Trio binning is an innovative, new approach which takes advantage of heterozygosity instead of trying to remove it [6]. In this method, a family trio is sequenced with short reads for both parents and long reads for an F1 offspring. Parent-specific k-mer markers are then identified from the parental reads and used to assign offspring reads into maternal and paternal bins, before assembling each parental haploid genome separately [6]. The ability of trio binning to accurately distinguish parental haplotypes increases at greater heterozygosity, with high-quality, de novo assemblies achieved for bovid genomes by crossing different breeds [6] and species [7] to maximise heterozygosity. Therefore, trio binning has the potential to overcome current difficulties faced by highly heterozygous genomes, which have typically evaded highquality assembly through conventional methods.
We utilised trio binning to assemble a high-quality, haplotype-resolved reference genome for the wood tiger moth (Arctia plantaginis; formerly Parasemia plantaginis [8]). This represents the first trio binned assembly available for Insecta and indeed any invertebrate animal species, diversifying the organisms for which trio binning has been applied outside of bovids [6,7], zebra finches [9], humans [6,9,10] and Arabidopsis thaliana [6]. Using a family trio with same-species A. plantaginis parents, 99.98% of offspring reads were successfully binned into parental haplotypes. This was possible due to the high heterozygosity of the A. plantaginis genome; heterozygosity of the F1 offspring was estimated to be ~1.9%, exceeding levels (~1.2%) obtained when crossing different bovid species [7]. Both resulting haploid assemblies are highly contiguous and complete, strongly supporting trio binning as an effective strategy for de novo assembly of heterozygous genomes.
The presented A. plantaginis assembly will also provide an important contribution to the growing collection of lepidopteran reference genomes [11]. Comparative phylogenomic studies will benefit from the addition of A. plantaginis to the phylogenomic dataset [12,13], being the first species to be sequenced within the Erebidae family [8,14], and the first fully haplotype-resolved genome available for Lepidoptera. A. plantaginis itself is an important evolutionary study system, being a moth species which uses aposematic hindwing colouration to warn avian predators of its unpalatability [15]. Whilst female hindwing colouration varies continuously from orange to red, male hindwings exhibit a discrete colour polymorphism maintained within populations (Figure 1), varying in frequency from yellowwhite in Europe and Siberia, yellow-red in the Caucasus, and black-white in North America and Northern Asia [16,17]. Hence, A. plantaginis provides a natural system to study the evolutionary forces that promote phenotypic diversification on local and global scales, for which availability of a high-quality, haplotype-resolved and annotated reference genome will now transform genetic research.

Cross preparation and sequencing
To obtain an A. plantaginis family trio, selection lines for yellow and white male morphs For short-read sequencing of the father (sample ID: CAM015099; ENA accession number:

Trio binning genome assembly
Canu version 1.8 [18] was used to bin A. plantaginis F1 offspring PacBio (Pacific Biosciences) subreads into those matching the paternal and maternal haplotypes defined by kmers specific to the maternal and paternal Illumina data (Supplementary Figure 1). This resulted in 1,662,000 subreads assigned to the paternal haplotype, 1,529,779 subreads assigned to the maternal haplotype, and 2,445 (0.07%) subreads unassigned. Using only the assigned reads, the haplotype binned reads were assembled separately using wtdbg2 version 2.3 [19], with the '-xsq' pre-set option for PacBio Sequel data and an estimated genome size of 550Mb. The assemblies were polished using Arrow version 2.3.3 [20] and the haplotype binned PacBio reads. The 10X linked-reads were then used to scaffold each assembly using scaff10x [21], followed by another round of Arrow polishing on the scaffolds.  [23], then applied homozygous nonreference edits to the assembly using bcftools consensus [24]. The assembly was then split back into paternal and maternal components, giving separate paternal haplotype (iArcPla.TrioW) and maternal haplotype (iArcPla.TrioY) assemblies.
The assemblies were checked for contamination and further manually assessed and corrected using gEVAL [25]. The Kmer Analysis Toolkit (KAT) version 2.4.2 [26] was used to compare k-mers from the 10X Illumina data to k-mers in each of the haplotype-resolved assemblies, and in the combined diploid assembly representing both haplotypes. Phasing of the assembled contigs and scaffolds was visualised using the parental k-mer databases produced by Canu [27]. Haploid genome size, heterozygosity and repeat fraction of the F1 offspring were estimated using GenomeScope [28] and k-mers derived from the 10X Illumina data.

Quality assessment
To assess the quality of each parental haplotype of the A. plantaginis trio binned assembly, standard contiguity metrics were computed, and assembly completeness was evaluated by  [37] was downloaded from NCBI RefSeq version 94 [38]. Cumulative scaffold plots were visualised in R version 3.5.1 [39] using the ggplot2 package version 3.1.1 [40].

Genome annotation
Genome annotations were produced for each parental haplotype of the A. plantaginis trio binned assembly using the BRAKER2 version 2.1.3 pipeline [41]. A de novo library of repetitive sequences was identified with both genomes using RepeatScout version 1.0.5 [42].
Repetitive regions of the genomes were soft masked using RepeatMasker version 4.0.9 [43], Tandem Repeats Finder version 4.00 [44] and the RMBlast version 2.6.0 sequence search engine [45] combined with the Dfam_Consensus-20170127 database [46]. Raw RNA-seq reads were obtained from Galarza et al. 2017 [47] under study accession number PRJEB14172, and arthropod proteins were obtained from OrthoDB [48]. RNA-seq reads were trimmed for adapter contamination using cutadapt version 1.8.1 [49] and quality controlled pre and post trimming with fastqc version 0.11.8 [50]. RNA-seq reads were mapped to each respective genome using STAR (Spliced Transcripts Alignment to a Reference) version 2.7.1 [51]. Arthropod proteins were aligned to the genomes using GenomeThreader version 1.7.0 [52]. BRAKER2's ab initio gene predictions were carried out using homologous protein and de novo RNA-seq evidence using Augustus version 3.3.2 [41] and GeneMark-ET version 4.38 [41]. Annotation completeness was assessed using BUSCO version 3.0.2 against the 'insecta_odb9' database of 1658 Insecta BUSCO genes with default Augustus parameters [29].

Cytogenetic analysis
Spread chromosome preparations for cytogenetic analysis were produced from wing imaginal discs and gonads of third to fifth instar larvae, according to Šíchová et al. 2013 [53]. Female and male gDNA were extracted using the CTAB (hexadecyltrimethylammonium bromide) method, adapted from Winnepenninckx et al. 1993 [54]. These were used to generate probe and competitor DNA, respectively, for genomic in situ hybridization (GISH

Population genomic analysis
We implemented the novel A. plantaginis reference assembly to analyse patterns of population genomic variation between 40 wild, adult males sampled from the European portion of A. plantaginis' Holarctic species range [17]. Samples were collected by netting and pheromone traps from Central Finnish (n=10) and Southern Finnish populations (n=10) where yellow and white morphs exist in equal proportions, an Estonian population (n=5) where white morphs are frequent compared to rare yellow morphs, a Scottish population (n=10) where only yellow morphs exist, and a Georgian population (n=5) where red morphs exist alongside yellow morphs ( Figure 5A). Exact sampling localities are available in Supplementary

Results and Discussion
Trio binning genome assembly K-mer spectra plots ( Figure 2) indicate a highly complete assembly of both parental haplotypes in the A. plantaginis diploid offspring genome. There is good separation between the parental haplotypes, as each haploid assembly consists mostly of single-copy k-mers with low frequency of 2-copy k-mers, indicating a correctly haplotype-resolved assembly with low levels of artefactual duplication ( Figure 2B, 2C; Supplementary Figure 2). This is also confirmed by the spectra plot for the combined diploid assembly (Figure 2A previously achieved through an inter-species cross between yak (Bos grunniens) and cattle (Bos taurus), which gave an F1 heterozygosity of ~1.2% [7].
The trio binned A. plantaginis assemblies are of comparable quality to the best reference genomes available for Lepidoptera (Table 2; Figure 3B). When compared to other published lepidopteran reference genomes, quality of the A. plantaginis assemblies surpasses all but the best Heliconius melpomene [32] and Bombyx mori [35] assemblies (Table 2; Figure 3B). As contiguity of the H. melpomene assembly was improved through pedigree linkage mapping and haplotypic sequence merging [32], whilst bacterial artificial chromosome (BAC) and fosmid clones were used to close gaps in the B. mori assembly [35], it is impressive that trio binning has instantly propelled contiguity of the A. plantaginis genome to very near that of H. melpomene and B. mori, before incorporating information from any additional technologies.
Therefore, these comparisons strongly support trio binning as an effective strategy for de novo assembly of highly heterozygous genomes. Future chromosomal-level scaffolding work through Hi-C scaffolding technology [67] will elevate the A. plantaginis assembly quality to the top tier.

Cytogenetic analysis
Mitotic nuclei prepared from wing imaginal discs of A. plantaginis larvae contained 2n=62 chromosomes in both sexes ( Figure 4) in agreement with a previously reported modal chromosome number of arctiid moths [68], which is also the likely ancestral lepidopteran karyotype [34]. These insights will be helpful for future scaffolding work into a chromosomal-scale A. plantaginis reference assembly. Chromosomes decreased gradually in size, as is typical for lepidopteran karyotypes [69]. Due to the holokinetic nature of lepidopteran chromosomes, separation of sister chromatids by parallel disjunction was observed in mitotic metaphases [70]. Notably, two smallest chromosomes separated earlier

Population genomic variation across the European range
As an empirical application of the A. plantaginis reference genome, we conducted a population resequencing analysis to describe genomic variation between 40 wild A.
plantaginis males from five populations spread across Europe ( Figure 5A). PCA revealed clear population structuring with individuals clustering geographically by country of origin ( Figure 5B), in congruence with strongly supported phylogenomic groupings also by country of origin ( Figure 6). Central and Southern Finnish individuals grouped into a single population as expected from their geographic proximity ( Figure B; Figure 6). The Finnish and Estonian populations clustered together away from the Scottish population along principle component (PC) 2 ( Figure 5B) and on the phylogenetic tree ( Figure 6), as would be predicted by effects of isolation by distance [71]. The Georgian population was highly genetically differentiated from all other sampled European populations, separating far along PC1 ( Figure 5B) and possessing a much longer inter-population branch in the ML tree ( Figure 6). Since the Georgian population has a distinctive genomic composition from the rest of the sampled distribution, this could support the hypothesis of incipient speciation in the Caucasus [17]. However, populations must be sampled in the large geographic gap between Georgia and the other populations in this preliminary analysis, to determine if genetic differentiation still persists when compared to nearby Central European populations.
Internal branch lengths were strikingly shorter within the Georgian population, indicating much higher intra-population relatedness than in populations outside of Georgia ( Figure 6). plantaginis, with founders of the Caucasus population restricted during severe glacial conditions. The species origin of A. plantaginis therefore remains unknown, and may be clarified by future inclusion of an Arctia outgroup to root the phylogenetic tree.

Conclusions
By converting heterozygosity into an asset rather than a hindrance, trio binning provides an effective solution for de novo assembly of heterozygous regions, with this high-quality A.
plantaginis reference genome paving the way for the use of trio binning to successfully assemble other highly heterozygous genomes. As the first trio binned genome available for any invertebrate species, the A. plantaginis assembly adds supports to trio binning as the best method for achieving fully haplotype-resolved, diploid genomes. The high-quality A.
plantaginis reference assembly and annotation itself will contribute to Lepidoptera comparative phylogenomics by broadening taxonomic sampling into the Erebidae family, whilst facilitating genomic research on A. plantaginis itself.

Availability of supporting data
All raw sequencing data for Arctia plantaginis reported in this article are available under ENA study accession number PRJEB36595.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.   green k-mers are represented thrice (3-copy). The first peak corresponds to k-mers missing from the assembly due to sequencing errors, the second peak corresponds to k-mers from heterozygous regions, and the third peak corresponds to k-mers from homozygous regions.

Funding
These plots show a complete and well-separated assembly of both haplotypes in the F1 offspring diploid genome.          We hereby submit our research article entitled "A haplotype-resolved, de novo genome assembly for the wood tiger moth (Arctia plantaginis) through trio binning" for publication as a Data Note in GigaScience.
Achieving a high-quality, de novo assembly for highly heterozygous, diploid genomes is a major challenge for traditional assemblers, which collapse parental haplotypes into an artificial consensus. Trio binning is an innovative, new method where a family trio is sequenced, then heterozygosity is exploited to partition and assemble each parental haplotype separately, enabling full haplotype resolution of the offspring genome. In our manuscript, we assembled a high-quality reference genome for the wood tiger moth (Arctia plantaginis) via trio binning, sequenced with Illumina technology for the parents, and Pacific Biosciences and 10X Chromium technologies for their offspring. Due to the high, innate heterozygosity of the A. plantaginis genome, our samespecies family trio achieved heterozygosity levels exceeding those previously obtained with an inter-species bovid cross. This enabled trio binning to work with great success, and instantly propel assembly quality towards those of the best lepidopteran reference genomes available, as we demonstrate through comparative quality assessments in this manuscript. Therefore, our successful assembly paves the way for use of trio binning as a potent strategy for assembling other heterozygous genomes. To our best knowledge, our assembly also represents the first invertebrate genome to be assembled via trio binning, thus significantly broadening the diversity of organisms for which trio binning has been applied, adding support to trio binning as the best method for assembling fully haplotype-resolved, diploid genomes.
The A. plantaginis reference genome and annotation itself will contribute to the growing collection of lepidopteran genomes, as the first species sequenced within the Erebidae family. This reference genome will also elevate genetic research within the A. plantaginis evolutionary study system, and we include an empirical application of the reference genome in this manuscript, where we report genome-wide structure and relationships between 40 wild resequenced males collected from five geographic populations spread across the European species range.