De novo genome assembly of the tobacco hornworm moth (Manduca sexta)

Abstract The tobacco hornworm, Manduca sexta, is a lepidopteran insect that is used extensively as a model system for studying insect biology, development, neuroscience, and immunity. However, current studies rely on the highly fragmented reference genome Msex_1.0, which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies. We present a new reference genome for M. sexta, JHU_Msex_v1.0, applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly is 470 Mb and is ∼20× more continuous than the original assembly, with scaffold N50 > 14 Mb. We annotated the assembly by lifting over existing annotations and supplementing with additional supporting RNA-based data for a total of 25,256 genes. The new reference assembly is accessible in annotated form for public use. We demonstrate that improved continuity of the M. sexta genome improves resequencing studies and benefits future research on M. sexta as a model organism.


Introduction
As a large, easily grown insect, the tobacco hornworm (M. sexta, NCBI: txid7130) is a key model for studying biochemical mechanisms in insect biochemistry, physiology, neurobiology, development, and immunity (Kay et al. 1992;Truman and Reiss 1995;Wieczorek et al. 1999;Mechaber et al. 2002). In particular, M. sexta is a well characterized model system for studying programmed cell death and metamorphosis (Jones et al. 1995). For example, the term "autophagic cell death" was introduced while studying metamorphic degeneration of muscle tissue in silkmoths (Lockshin and Williams 1965). Since then, the autophagic pathway has been studied extensively during the metamorphosis of many other lepidopterans. The size of M. sexta makes it an optimal system for studying the biochemical pathways associated with programmed cell death, but the system is limited by the quality of the reference genome. M. sexta's innate immune pathways, many of which share mammalian homologs, have been characterized in detail and used as a model system for studying fungal virulence and drug efficacy (Kanost et al. 2004;Lyons et al. 2020). Finally, M. sexta is useful for understanding serine proteases, a repertoire of proteins involved in mediating defence responses such as hemolymph clotting, melanotic encapsulation, food digestion, antimicrobial peptide induction, and cytokine activation (Jiang et al. 2010;Cao et al. 2015). The serine protease gene family constitutes a large protein family in insects containing 50-300 genes (Cao et al. 2015). In addition to basic biology, more study of this insect has practical applications; M. sexta is a prevalent agricultural pest that feeds on solanaceous plants, e.g., tobacco, and has the unique ability to tolerate large amounts of solanaceous alkaloids, such as nicotine (de Boer and Hanson 1987). Recent advancements in genomic sequencing have led to chromosome level reference genomes of other lepidopterans such as the domestic silkworm (Bombyx mori) and the monarch butterfly (Danaus plexippus) (Gu et al. 2019;Kawamoto et al. 2019). However, despite the importance of M. sexta as a model organism, no improvements to its reference assembly have been made since its release in 2016.
The current M. sexta genome is relatively fragmented, due to the abundance of repetitive sequences contained in most eukaryotic genomes. This lessens the value of the reference due to incomplete gene sequences and unanchored or mispositioned contigs on chromosomes. The first effort to sequence and assemble the M. sexta genome resulted in the Msex_1.0 assembly (GCA_000262585.1), a 419.4 Mb genome composed of 20,869 scaffolds with the largest scaffold being just 3.2 Mb (N50 ¼ 664 Kb). The accompanying gene set comprised of 15,451 protein-coding genes, of which 2498 were manually curated (Kanost et al. 2016). This level of fragmentation in the assembly can lead to erroneous gene annotations and mis-mapping of resequencing data, therefore, slowing progress in Lepidopteran genomics (Denton et al. 2014). The continuity of the M. sexta reference assembly significantly lags behind other lepidopteran reference genomes such as the domestic silkworm (B. Mori), a 460 Mb genome with a scaffold N50 of 16.8 Mb (Kawamoto et al. 2019). Because M. sexta is a model species in lepidoptera, its genome assembly needs to be highly accurate and contiguous to allow for comparative genomics studies of insect diversity, biochemical studies on homologous mammalian pathways, and agricultural studies regarding the tomato and tobacco crops, a food source for M. sexta.
Utilizing advancements in sequencing technology, we generated a new chromosome level M. sexta assembly (JHU_Msex_v1.0) that contains 70 Mb more genomic sequence and is 20-fold more continuous than the previous one. This allowed refining of gene models, identification and classification of repetitive DNA and transposable elements, enhancement of alignment of resequencing data and improved comparative genomics analysis. Our assembly consists of 470 Mb with a scaffold N50 of 14.2 Mb. We lifted over the original genome annotation to our new assembly and supplemented it with new high quality novel gene models including 794 previously un-annotated genes, including 47 novel serine proteases, constructed using publicly available RNA sequencing data. We demonstrate the utility of our new genome assembly through comparative genomics and gene expression analysis throughout M. sexta development cementing our assembly as a core resource for the lepidopteran community.

Sample collection and sequencing
All DNA extraction and sample collection was done on a single male moth purchased as a pupa from Carolina Biological (Burlington, NC). On the day of emergence, the moth was sacrificed by placing it at 4 C for 10 minutes. The moth's body, legs, head (including antennae and proboscis), and wings were snapfrozen in liquid nitrogen, then stored at -80 C prior to DNA extraction. High-quality genomic DNA was extracted from the legs and wings of the adult male moth using Qiagen G-tips. Briefly, we pulverized 14 mg of wing and leg tissue to a fine powder with a RETSCH CryoMill. DNA was purified from this powder using the Qiagen Genomic-tip 20/G kit with a modified lysis buffer consisting of 20 mM EDTA, 100 mM NaCl, 1% Triton-X, 33 mM Guanidine Thiocyanate, and 10 mM Tris-HCl. The 50 C lysis step was extended overnight. The quantity and quality of DNA were measured with Qubit 3.0 (Thermo Fisher Scientific, Inc., Carlsbad, CA, USA). We performed a total of four DNA extractions to generate all the Nanopore and Illumina data. Each extraction generated $2 lg of high-quality DNA which was used for library preparation and high throughput sequencing with Oxford Nanopore and Illumina platforms (Table 1).
Oxford Nanopore sequencing libraries were prepared using the Ligation Sequencing 1 D Kit (Oxford Nanopore, Oxford, UK, SQK-LSK109) according to manufacturer's instructions and sequenced for 48 hours on three MinION R9.4.1 flow cells.
Nanopore reads were base-called with Albacore Sequencing Pipeline Software (version 2.1.10). Sequencing runs from leg and wing tissues were run on separate flow cells and the data pooled for genome assembly purposes. To avoid any microbial contamination in the assembly, microbial read sequences identified by Centrifuge (v1.0.3-beta) were removed from Oxford Nanopore data (Kim et al. 2016). This resulted in a total of 4,404,206 reads yielding 19.5 Gb with a read length N50 of 9156 bp (Table 1; Supplementary Table S1 and Figure S1). For shotgun Illumina sequencing, a paired-end (PE) library was prepared with the Nextera DNA Flex Library Prep Kit from Illumina and sequenced on the Illumina NovaSeq6000 (Illumina, Inc., San Diego, CA, USA) yielding 219 M paired-end 150 (PE150) reads. A total of 32.83 Gb of Illumina data were generated and used for genome survey, correction, and evaluation. All sequencing data have been deposited at the NCBI SRA database under BioProject PRJNA658700.

Hi-C
To assemble contigs into chromosome sized scaffolds, we generated Hi-C libraries using 200 mg of tissue from the head (including proboscis and antenna) of the same moth. The tissue was grounded with a mortar and pestle cooled by liquid nitrogen. We then followed the Phase Genomics Proximo HiC animal kit (Phase Genomics, Inc., Seattle, WA, USA) protocol. The quality of the resulting purified library was evaluated with Qubit 3.0 (Thermo Fisher Scientific, Inc.), and the Agilent 4200 TapeStation System (Agilent Technologies, Inc., Santa Clara, CA, USA). For a final QC step, we ran the library on a 2x100 cycle sequencing run on an Illumina MiSeq v2 (Illumina, Inc., San Diego, CA, USA). The qualified library was sequenced using the Illumina NovaSeq6000 (Illumina, Inc., San Diego, CA, USA) again generating 150 bp paired-end reads. A total of 218 M reads (32.76 Gb) were generated on the NovaSeq6000 and used for the subsequent Hi-C analysis (NCBI SRA BioProject PRJNA658700).

Genome assembly and polishing
Nanopore sequencing reads were used to construct an initial assembly using Canu (v2.0) (Koren et al. 2017). This initial assembly contained 5381 contigs with a contig N50 of 424,554bp. A contig is a continuous stretch of DNA sequence and a scaffold is the connection of multiple contigs filled in with gaps. Longer contigs and scaffolds indicate more continuous genome assemblies. Next, we polished the Canu assembly by first aligning all nanopore data to the draft assembly with Minimap2 (2.17-r943-dirty) and running a single iteration of the nanopolish (v0.11.1) consensus module (Loman et al. 2015;Li 2018). Nanopolish uses a hidden Markov model (HMM) to examine the raw electrical data for possible improvements to the consensus sequence. The polished nanopore draft assembly was then error-corrected with 32.38 Gb of shotgun Illumina data using Bowtie2 (v2.4.1) for alignment and Racon (v1.3.3) for polishing (Langmead and Salzberg 2012; Vaser  (Marc¸ais et al. 2018). To assess the improvements made by polishing, we ran BUSCO, a quantitative assessment of genome assembly and completeness based on evolutionarily informed expectations of gene content (Simão et al. 2015). Errors in the assembly cause genes to go undetected by BUSCO, therefore being labeled as "missing" or "fragmented" BUSCOs. A single iteration of nanopolish improved BUSCO completeness scores by 4.0% ( Figure 1B). A single iteration of Racon improved BUSCO completeness scores by 19.6% ( Figure 1B) and further iterations had minimal improvements (Supplementary Figure S3).

Repetitive elements
With the genome in hand, next we examined the repetitive elements in the M. sexta genome assembly. We used Repeat Modeler2 to generate de novo repeat libraries (Flynn et al. 2020). Next, we performed homology based repeat masking using the de novo libraries in combination with the curated Metazoan library with RepeatMasker. We detected 159 Mb of repetitive sequence representing 33.91% of the genome, which is greater than the 28.84% detected from the Msex_1.0 assembly ( Figure 1A, Supplementary   A comparison of assembly continuity metrics between the JHU_Msex_V1.0 assembly and the Msex_1.0 assembly. Contig and scaffold N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value, therefore larger N50 values indicate more continuous assemblies. We note a contig N50 that is 20-fold greater, and the largest scaffold 7 times larger in the JHU_Msex_V1.0 compared to Msex_1.0. 2019), but notably greater than reported for the monarch butterfly, D. plexippus ($13% of 273 Mb) (Zhan et al. 2011). Among classified repeats, LINE elements were the most abundant superfamily found in the M. sexta genome, with the L2 LINE element being the most common (Supplementary Table S4). However, the overwhelming majority of masked regions dispersed throughout the genome corresponded to complex repetitive sequences yet to be characterized (11.88%), which is consistent with the domestic silkworm (11.73%).

RNA-seq data
Sixty-seven RNA-seq data sets from M. sexta were downloaded from the sequence read archive (SRA) with accession numbers listed in Supplementary Data S1 (Cao and Jiang 2017). All RNAseq sequences were first trimmed with TRIMMOMATIC (SLIDINGWINDOW:4:20 LEADING:10 TRAILING:10 MINLEN:50) prior to alignment to the JHU_Msex_V1.0 assembly with HISAT2 using default parameters (Bolger et al. 2014;Kim et al. 2019). An average of 96.2% of reads aligned from all libraries (Supplementary Data S1). These bam files were used for both gene annotation and expression analyses.

Gene annotation
Gene annotation was primarily accomplished via a liftover of the Msex_1.0 annotations from the NCBI annotation (GCA_ 000262585.1) with Liftoff, a tool that maps annotations between closely related species or, in this case, two assemblies of the same species ( Additionally, we assembled the full transcriptome from the RNA-seq data with Stringtie2 (Kovaka et al. 2019). Both the de novo transcriptome assembly and ab initio gene predictions are prone to assembling erroneous transcripts. In order to only filter for high quality transcripts, we first aligned both the Stringtie2 and AUGUSTUS gene models to the Liftoff gene models with GFFCompare (Pertea and Pertea 2020). To identify unannotated genes, we examined transcripts supported by evidence from both the Stringtie2 assembly and the AUGUSTUS prediction, but not contained in the Liftoff annotation. We identified 794 such transcripts to add to the annotation. Overall, we identified 25,256 total genes compared to the 24,462 contained in the NCBI annotation of Msex_1.0.

Data availability
The JHU_Msex_v1.0 genome, corresponding annotation, Supplementary Data S1 and S2, and Supplementary material is available on Zenodo (https://doi.org/10.5281/zenodo.4302295). This Whole Genome Shotgun project has been deposited at DDBJ/ ENA/GenBank under the accession JACVES000000000. The version described in this paper is version JACVES010000000. Whole genome sequencing data are available in the NCBI SRA database under BioProject PRJNA658700. The scripts we used in this article, including the genome assembly, genome polishing, repeat annotation, and genome assessments, are available on Github (https://github.com/timplab/moth).

Expression analysis
Using the same publicly available RNA-seq data (Supplementary Data S1), we wanted to examine how our assembly improved results of gene expression analysis. The average alignment rate from each dataset to JHU_Msex_v1.0 was 96.16% and to the Msex_1.0 assembly was 88.23% (Figure 3; Supplementary Data S1). We noted that paired-end data sets aligned significantly better to our assembly than to the old assembly with an increase of 11.92% of reads aligned as compared to single-end libraries where we only saw an improvement 4.06% of reads aligning. Gene abundance values were calculated using our annotation with the Stringtie2 gene abundance pipeline and the regularized log transformation (rlog) from DEseq2 (Love et al. 2014). The rlog transformation takes the read count data from Stringtie2 and accounts for differences in sequencing depth, RNA composition, heteroskedasticity, and large dynamic range (Love et al. 2014). To filter out genes with low expression in all tissues, we only retained genes if they had an rlog score of greater than three in at least one library. We then calculated the z-score of the rlog values and computed a Euclidean distance matrix to perform hierarchical clustering, which resulted in 18 gene clusters (Figure 3). To annotate the gene clusters, we ran Interproscan5 to add GO annotations for all genes then used TopGO for GO term enrichment calculations within each gene cluster (Alexa et al. 2006;Jones et al. 2014). Significantly enriched GO terms for biological process (BP), molecular function (MF), and cellular compartment (CC) were calculated by the Fisher's exact test and determined significant if the P-value was less than 0.05 (Supplementary Data S2). We note that these RNA-seq datasets were collected from different insects without biological replicates and throughout multiple RNA-seq studies and while results show interesting patterns, we cannot make definitive biological conclusions.
As expected, the enriched GO terms were well correlated with the expression pattern of gene clusters. For example, cluster 10 has the highest expression in the ovaries and testes and the most significantly enriched terms of cluster 10 include microtubulebased process, lncRNA processing and transcription (Supplementary Data S2). Sperm generation and meiosis requires continuous cell division and chromosome segregation, aided by the movement of microtubules. Additionally, the flagellar motor of sperm cells that aids in their motility is made up of microtubules. Cluster 7 includes genes, which are highly expressed in the antennae and in the adult head. Consistent with these tissues being responsible for sensory perception, GO annotations are enriched for GTPase activity and odorant binding (Supplementary Data S2). Insects have a repertoire of sensory driven behaviors and odorant receptor (ORs) genes code for an entire family of G protein coupled receptors (GPCRs), a class of transmembrane proteins that have GTPase activity (Benton 2006).
We also noted that both clusters 2 and 16 were highly expressed in the midgut tissue, albeit, at different developmental stages. Cluster 2 contains genes involved in proteolysis, metabolic processes, and catalytic activity, and the expression is high in the larval stages of the midgut until larval stage five (L5), the pre-wandering stage. In the stages after L5, expression of cluster 16 genes increases in the midgut (Supplementary Data S2). Cluster 16 contains genes involved in the negative regulation of metabolic processes and apoptotic processes (Supplementary Data S2). During the process of metamorphosis the moth larvae will stop feeding in the pre-wandering stage, which coincides with the decrease in the expression of the cluster 2 genes and increase in the negative regulation of metabolic processes. During the pupation process, the insects experience massive tissue reorganization including death of major organs (Zakeri et al. 1996;Dai and Gilbert 1997). This process is likely mediated by an apoptotic or an autophagic genes, eg., autophagy-related protein 8 (ATG8), as seen with B. mori (Romanelli et al. 2016).

Metamorphosis in the midgut
To showcase our new reference genome, we focused on gene expression changes occurring in the midgut during M. sexta development. We focused on serine proteases, which are proteins involved in catalytic cleavage of other proteins, significant in physiological processes like digestion, development, and defense. We ran Interproscan5 to identify putative serine proteases by Pfam classification (Finn et al. 2014). We identified 240 proteins with potential serine protease domains (PF00089), compared to the 193 identified in Msex_1.0 (Kanost et al. 2016). Of the serine proteases we identified 76 that were highly expressed primarily in the midgut, compared to the 68 gut proteases previously identified in Msex_1.0 (Kanost et al. 2016) ( Figure 4A). Upon the cessation of feeding in the L5 prewandering stage, expression of the digestive proteases declines rapidly as the insects prepare for pupation ( Figure 4, A and B). With our new highly contiguous genome assembly, we were able to annotate large protease gene arrays on scaffolds 13 (60 kb) and 18 (1 Mb).
To further investigate the mechanism of midgut tissue reorganization during metamorphosis, we analyzed the gene expression patterns of genes involved in both apoptosis and autophagy in the M. sexta midgut throughout development. We generated lists of autophagy and apoptosis related genes using the GO annotations GO0006914 and GO0006915 from Interproscan5 as those encompassed multiple Pfam domains and gave a more complete list of genes related to these pathways. We did not note a distinct expression pattern of apoptotic genes throughout midgut development (Supplementary Figure S5). However, we did identify an increase in expression of genes within the autophagic pathway beginning at the L5 pre-wandering stage (G-L5-preW-S), which coincides with the decline in digestive protease expression at the L5 wandering stage (G-L5-W-S/G-L5-W) ( Figure 4B). These results indicate the potential involvement of the autophagic pathway in M. sexta midgut during metamorphosis, at the same time in development that autophagy has been shown to remodel the midgut of other insects like fruit flies (Denton et al. 2009

Discussion
Our results have leveraged short and long read sequencing technology to assemble a highly contiguous reference genome of M. sexta. We lifted over the original genome annotations to this improved assembly, maintaining functional annotations and gene IDs for researchers currently studying M. sexta genes. We believe the improved assembly will aid current and future studies using M. sexta as a model system for research on fundamental processes in insect physiology and biochemistry. We note that this assembly is a male (ZZ) system and does not contain sequence for the female W chromosome. Future work could assembly the female W chromosome, as well as develop a deeper understanding of diversity in the species population.
Acknowledgments W.T., T.G.R., W.A.S., and A.G., conceived, initiated, and managed the project. T.G.R. reared insects with guidance from W.A.S. A.G. and T.G.R. were responsible for DNA sequence data production. A.G., T.G.R., and Y.F. performed assembly and associated tasks. A.G. and R.R. performed quality control and/or contributed additional analyses. A.G. performed RNA analyses and annotation  Figure S6) and managed public presentation of the assembly files. A.G. and W.T. wrote and edited the manuscript. All authors read and edited the manuscript.

Funding
This work was partially supported by NIH Ruth L. Kirschstein Institutional National Research Service Award (T32) GM007445 and NEU REU-Site Award 1757443.
Conflicts of interest: W.T. has two patents (8,748,091 and 8,394,584) licensed to Oxford Nanopore Technologies.  Cao and Jiang (2017). The first part of the library names indicates that the libraries are made from midgut (G). The second part indicates major stages of the insect, i.e., embryo (E), 1st to 5th instar larvae (L1 À L5), pupae (P), and adults (A). In the third part, "D" stands for day, "h" for hour, "preW" for pre-wandering, "W" for wandering. "S" in the last part of library names indicates single-end sequencing; no "S" in the end indicates paired-end sequencing. The libraries present as follows: midgut (G) (2nd L; 3rd L; 4th L, 0 h; 4th L, 12 h; 4th L, late; 5th L, 1-3 h; 5th L, 24 h;. 5th L, preW; 5th L, W; P, D1; P, D15-18; A, D3-5). (B) (Left) Heatmap of expression Z-score of midgut autophagic genes throughout development. Gene names were assigned based on the NCBI GCA_000262585.1. Genes not present in this annotation were functionally annotated with Interproscan5 and assigned gene names. All genes in the heatmap were annotated as autophagy Gene Ontology Term (GO:0006914)