Draft Genome Assemblies and Annotations of Agrypnia vestita Walker, and Hesperophylax magnus Banks Reveal Substantial Repetitive Element Expansion in Tube Case-Making Caddisflies (Insecta: Trichoptera)

Abstract Trichoptera (caddisflies) play an essential role in freshwater ecosystems; for instance, larvae process organic material from the water and are food for a variety of predators. Knowledge on the genomic diversity of caddisflies can facilitate comparative and phylogenetic studies thereby allowing scientists to better understand the evolutionary history of caddisflies. Although Trichoptera are the most diverse aquatic insect order, they remain poorly represented in terms of genomic resources. To date, all long-read based genomes have been sequenced from individuals in the retreat-making suborder, Annulipalpia, leaving ∼275 Ma of evolution without high-quality genomic resources. Here, we report the first long-read based de novo genome assemblies of two tube case-making Trichoptera from the suborder Integripalpia, Agrypnia vestita Walker and Hesperophylax magnus Banks. We find that these tube case-making caddisflies have genome sizes that are at least 3-fold larger than those of currently sequenced annulipalpian genomes and that this pattern is at least partly driven by major expansion of repetitive elements. In H. magnus, long interspersed nuclear elements alone exceed the entire genome size of some annulipalpian counterparts suggesting that caddisflies have high potential as a model for understanding genome size evolution in diverse insect lineages.


Introduction
With 16,544 extant species (Morse 2020), caddisflies (Insecta: Trichoptera) are the most diverse of the primary aquatic insect orders, comprising more species than the other four (Odonata, Ephemeroptera, Plecoptera, and Megaloptera) combined (Dijkstra et al. 2014). This diverse group of insects has successfully colonized all types of freshwater (and even intertidal) habitats across all continents north of Antarctica. Within these freshwater ecosystems caddisflies play important roles, including nutrient cycling and energy flow, and stabilizing the waterbed. They also act as biological indicators of water quality (Morse et al. 2019). Trichoptera is divided into two suborders, Annulipalpia and Integripalpia, both of which produce silk in modified labial glands (Thomas, Frandsen, et al. 2020). Annulipalpians use silk to construct small homes and capture nets that are fixed to the substrate, whereas most integripalpians' use silk to connect material into portable tube cases offering protection, camouflage, and even aiding in respiration ( fig. 1) (Wiggins 2004). This innovation in extended phenotype has potentially facilitated their radiation across a multitude of different environments including streams, lakes, ponds, and even marine environments.
Relative to their diversity, most insect groups remain poorly represented in existing genomic resources-a trend which is particularly pronounced in aquatic insects (Hotaling et al. 2020). Yet insects commonly show dynamic genome evolution within groups, including major variation in genome size that is often linked to expansion and loss of repetitive DNA (Lower et al. 2017;Petersen et al. 2019;Pflug et al. 2020). Insect diversity offers a vast supply of potential model systems for understanding how genomes evolve, especially as advancing sequencing technology enables more cost-effective, highquality genome assemblies in any model system. Currently, there are three long-read based draft Trichoptera genome assemblies (Luo et al. 2018;Heckenhauer et al. 2019 ; table 1). However, all of these were generated for species within the suborder Annulipalpia, leaving approximately 10,453 integripalpian species (Morse 2020) and 275 Myr of evolutionary history poorly represented by genomic resources (Thomas, Frandsen, et al. 2020). In addition, a lack of genetic resources in the large case-making radiation within caddisflies prevents research into the genomic basis of the fascinating evolutionary history and ecological diversification of this diverse and important group of caddisflies. Here, we report the first long-read based de novo genome assemblies and annotations of two tube-case making intergripalpian caddisflies, Agrypnia vestita Walker and Hesperophylax magnus Banks. Their estimated genome sizes are more than 3-fold larger than previously sequenced annulipalpian caddisflies. We show that this is, at least partly, due to a large expansion of repeat content in the case-making caddisflies compared with retreat-making caddisflies.

Sequencing and Assembly
We collected individuals of both species in the wild, A. vestita as an adult and H. magnus as a pupa. Following extraction, we sequenced genomic DNA from A. vestita on an Illumina HiSeq 2500 lane and on 23 PacBio sequel SMRT cells. We sequenced genomic DNA from H. magnus using two Illumina NovaSeq and four Oxford Nanopore FLO-MIN 106 flow cells. Further details are provided in supplementary note 1, Supplementary Material online. For both data sets, we conducted a de novo hybrid assembly using MaSuRCA v.3.1.1 (Zimin et al. 2013(Zimin et al. , 2017. MaSuRCA aligns highfidelity short reads to more noisy long reads to generate "megareads," which are then assembled in CABOG (Miller et al. 2008), an overlap-layout-consensus assembler. In the config file for each run, we specified an insert size of 500 bp for the Illumina paired-end reads with a standard deviation of 50 and a Jellyfish hash size of 100,000,000,000. All other parameters were left as defaults. We screened genome assemblies for potential contaminants with BlobTools v1.0 (Laetsch and

Significance
There is a lack of genomic resources for aquatic insects. So far, in the insect order Trichoptera, only three high-quality genomes have been assembled, all from individuals in the retreat-making suborder Annulipalpia. In this article, we report the first high-quality genomes of two case-making species from the suborder Integripalpia, which are essential for studying genomic diversity across this ecologically diverse insect order. Our research reveals larger genome sizes in the tube case-makers (suborder Integripalpia, infraorder Phryganides), accompanied by a disproportionate increase of repetitive DNA. This suggests that genome size is at least partly driven by a major expansion of repetitive elements. Our work shows that caddisflies have high potential as a model for understanding how genomic diversity might be linked to functional diversification and forms the basis for detailed studies on genome size evolution in caddisflies.
Supplementary Material online) with the OrthoDB v.10 Insecta and Endopterygota gene sets (Kriventseva et al. 2019) and generated genome statistics using the assembly_stats script (Trizna 2020; supplementary table 1, Supplementary Material online, for full output). We conducted genome profiling (estimation of major genome characteristics such as size, heterozygosity, and repetitiveness) on the short-read sequence data with GenomeScope 2.0 (Ranallo-Benavidez et al. 2020) and findGSE (Sun et al. 2018)

Repeat and Gene Annotation
We conducted comparative analysis of repetitive elements for the genomes generated in this study, the three previously available long-read based annulipalpian genomes, and Limnephilus lunatus, the integripalpian with the highest quality short-read genome assembly. We identified and classified repetitive elements de novo and generated a library of consensus sequences for each genome using RepeatModeler 2.0 (Flynn et al. 2019). We then annotated and masked repeats in each assembly with RepeatMasker 4.1.0 (Smit et al. 2013(Smit et al. -2015 using the custom repeat library for the species generated in the previous step. Finally, we reran RepeatMasker on Comparison of assembly length and repetitive DNA content among genome assemblies for three annulipalipan (Hydropsyche tenuis, Plectrocnemia conspersa, and Stenopsyche tienmushanensis) and three integripalpian (Hesperophylax magnus, Limnephilus lunatus, and Agrypnia vestita) species. The total length of bars indicates assembly size and colored segments within bars indicate the fraction of the assembly belonging to major repeat categories identified by RepeatModeler2 and annotated by RepeatMasker. Artwork to the right of plots shows examples of fixed retreats built by annulipalpians compared with integripalpian tube cases. Each Illustration is derived from a member of the same genus as the genome assemblies. The phylogeny is based on Thomas, Frandsen, et al. (2020). the masked genome using the Repbase arthropod repeat library (Bao et al. 2015). We conducted an orthogonal analysis of repeat dynamics using a reference-free approach by normalizing subsampled Illumina data for each sample using RepeatProfiler (Negm et al. 2020) and then analyzing normalized data sets for repeat content in RepeatExplorer2 (Nov ak et al. 2013), with more details provided in supplementary note 5, Supplementary Material online.
To generate evidence for gene annotation, we aligned previously sequenced transcriptomes to each genome using BLAST-like Alignment Tool v3.6 (BLAT, Kent 2002). We aligned the transcriptome of the closely related Phryganea grandis from the 1KITE project (111126_I883_FCD0GUKACXX_L7_ INShauTBBRAAPEI-22 http://www.1kite.org/) to A. vestita, and we aligned the Hesperophylax transcriptome from (Wang et al. 2015) to H. magnus. We generated ab initio gene predictions using AUGUSTUS v3.3 (Stanke et al. 2008) with hints generated from RepeatMasker 4.1.0 and BLAT v3.6 and by supplying the retraining parameters obtained from the BUSCO analysis (supplementary note 6, Supplementary Material online). Following annotation, we removed genes from our annotation that did not generate significant BLAST hits or lacked transcript evidence. Lastly, functional annotations were identified using Blast2GO (Gö tz et al. 2008).

Results and Discussion
Assembly Here, we generated the first genome assemblies based on long-read sequencing from the species diverse caddisfly suborder, Integripalpia. They provide important genome resources and fill a gap in evolutionary history of more than 275 Myr (Thomas, Frandsen, et al. 2020). The A. vestita genome was sequenced using $88Â Illumina sequence coverage and $18Â PacBio read coverage. After contaminated contigs were removed, the resulting assembly contained 25,541 contigs, a contig N50 of 111,757 bp, GC content of 33.77%, and a total length of 1,352,945,503 bp. BUSCO analysis identified 94.4% (91.4% complete, 3.0% fragmented) of the Insecta gene set in the assembly (see supplementary note 3, Supplementary Material online, for further details). The H. magnus genome was sequenced with $49Â Illumina sequence coverage and $26Â Oxford Nanopore sequence coverage. The resulting assembly has 6,877 contigs, a contig N50 of 768,217 bp, GC content of 34.36%, and a total length of 1,275,967,528 bp. We identified 95.9% (95.2% complete, 0.7% fragmented) of the Insecta BUSCO gene set in the final assembly. Although the sequencing and assembly techniques were similar to those used in previous efforts to sequence and assemble high-quality reference genomes in Trichoptera (table 1, Luo et al. 2018;Heckenhauer et al. 2019), the contiguity of these genomes was lower. This is likely to have been caused by large genome size and the proliferation of repetitive DNA, which represents one of the primary barriers to genome assembly. However, despite these challenges, both genome assemblies represent a substantial improvement in contiguity to previous assemblies of integripalpian caddisflies generated from short-read data alone. For example, at the time of writing, the highest quality integripalpian genome assembly on GenBank is Limnephilus lunatus, which was assembled from short-read data and has a contig N50 of 24.2 kb, giving further evidence to the difficulty of assembling large, repetitive caddisfly genomes.

Annotation and Repeat Analysis
We also report the functional annotations of H. magnus and A. vestita. Of 59,600 proteins predicted by AUGUSTUS for A. vestita, 21,637 were verified by BLAST and/or transcript evidence (and maintained in the final annotation), 14,096 were mapped to GO terms, and 5,362 were functionally an- The results of genome assembly repeat annotation, genome profiling, and de novo repeat assembly with RepeatExplorer2 all showed a disproportionate increase of repetitive DNA in integripalpian genomes compared with annulipalpians for those species sampled ( fig. 1, supplementary note 5 and supplementary figs. 5-7, Supplementary Material online). In the integripalpian species, unclassified repeats alone make up an average of >500 million bases, which exceeds the estimated genome sizes of all three annulipalpians analyzed. After unclassified repeats, long interspersed nuclear elements (LINEs) are the most abundant repeat category showing a disproportionate increase. LINEs comprise an average of >200 million bases in integripalpians, and show a $4-fold average increase in genome proportion (avg. genome proportion ¼ 15.1%) compared with the annulipalpians (avg. genome proportion ¼ 3.8%). Hesperophylax has more bases annotated as LINEs ($249 million) than the size of the entire Hydropsyche genome assembly. Rolling circles and long terminal repeats (LTRs) also show disproportionate increase in integripalpians ($3.5-and 18-fold increases in genome proportion, respectively), however both categories make up a much smaller fraction of integripalipan genomes (<2.3% on an average). DNA transposons are abundant in all integripalpian genomes we studied (average of 92 million bases annotated), however their genomic proportion decreased relative to annulipalipans in which DNA transposons were the most abundant classifiable repeat category (11.3% avg. genome proportion in annulipalipans vs. 6.9% in integripalpians).
The high abundance of unclassified repeats we observed in the integripalpian genomes is not surprising given that Trichoptera repeats are poorly represented in repeat databases. Unclassified repeats may also represent the remnants of ancient transposable element expansions, which are particularly difficult to annotate (Hoen et al. 2015). This explanation of old repeat expansions accounting for much of the unclassified repeats is consistent with results of clustering analysis in RepeatExplorer2 which shows many unannotated superclusters that make up small fractions of the genome (supplementary fig. 3, Supplementary Material online). We do not observe large unannotated superclusters that would indicate failed annotation of abundant, recently active repeats. Given the apparent suborder-specific increase in unclassified repeats, we hypothesize that ancient transposable element activity in the ancestor of integripalpians contributed to the larger genome sizes we observe; however, denser sampling of genomes across Trichoptera suborders is needed to address this hypothesis.
Given the major variation in genome size and repeat abundance, our findings suggest Trichoptera has high potential as a model for gaining insights into genome evolution in diverse insect lineages. Future investigation on the role of LINEs in genome diversification is of particular interest given our findings. We present preliminary evidence that LINEs show suborder-specific expansions, albeit with very limited taxon sampling. LINEs (especially L1) play major roles in genome stability, cancer, and aging (Van Meter et al. 2014;De Cecco et al. 2019). In many groups LINEs are hypothesized to play important evolutionary roles, including roles in rapid genome evolution though their own movement (Kordi s et al. 2006;Warren et al. 2008;Suh et al. 2015), and by facilitating expansion of other repeat classes (Grandi and An 2013;Sproul et al. 2020). It is possible that these elements have been important drivers in the expansion of integripalpian genomes. The high-quality genome assemblies and repeat libraries we present here provide a starting point for investigating the role of repeats in genome evolution across caddisfly lineages. In addition, we close a large evolutionary gap in genomic resources within a large, ecologically diverse clade in which additional genome sequencing can enable new insights as to the genomic basis of adaptation and diversification within freshwater environments.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.
performed on the Smithsonian High-Performance Computing Cluster (SI/HPC: doi.org/10.25572/SIHPC). L.K.O. was supported by a BYU College of Life Sciences Undergraduate Research Award.