Chromosome-Level Genome Assembly and Annotation of a Periodical Cicada Species: Magicicada septendecula

Abstract We present a high-quality assembly and annotation of the periodical cicada species, Magicicada septendecula (Hemiptera: Auchenorrhyncha: Cicadidae). Periodical cicadas have a significant ecological impact, serving as a food source for many mammals, reptiles, and birds. Magicicada are well known for their massive emergences of 1 to 3 species that appear in different locations in the eastern United States nearly every year. These year classes (“broods”) emerge dependably every 13 or 17 yr in a given location. Recently, it has become clear that 4-yr early or late emergences of a sizeable portion of a population are an important part of the history of brood formation; however, the biological mechanisms by which they track the passage of time remain a mystery. Using PacBio HiFi reads in conjunction with Hi-C proximity ligation data, we have assembled and annotated the first whole genome for a periodical cicada, an important resource for future phylogenetic and comparative genomic analysis. This also represents the first quality genome assembly and annotation for the Hemipteran superfamily Cicadoidea. With a scaffold N50 of 518.9 Mb and a complete BUSCO score of 96.7%, we are confident that this assembly will serve as a vital resource toward uncovering the genomic basis of periodical cicadas’ long, synchronized life cycles and will provide a robust framework for further investigations into these insects.


Introduction
Periodical cicadas are plant-sucking bugs in the order Hemiptera-the genus Magicicada contains 7 species, which are distributed widely across the eastern United States.They range from Nebraska to Texas at the eastern edge of the Great Plains, to the Atlantic coast from Massachusetts to Georgia (Simon 1988).The genus can be divided into 3 morphologically distinct species groups: Decim, Decula, and Cassini.These 3 species groups each contain one 17-yr species and one or two 13-yr species.During their nymphal stages, cicadas feed underground on the xylem fluid of tree roots, and after 13 or 17 yr, emerge in large year classes called "broods," which are composed of multiple different Magicicada species (Williams and Simon 1995;Simon et al. 2022).These emergences have a major ecological impact-for example, many avian species rely heavily on adult cicadas to feed their young and even experience population growth corresponding with a brood emergence (Koenig and Leibhold 2005;Pons 2020).Periodical cicadas are one of the quintessential examples of the evolutionary strategy of predator satiation, which is likely one of the selective pressures that influenced the development of such unique life cycles (Karban 1982;Williams et al. 1993;Koenig and Leibhold 2013).Magicicada septendecula is one of three species belonging to the Great Eastern Brood, or Brood X, which last emerged in 2021 and will reappear in the next generation in 2038 (Kritsky 2021).
Several mysteries surround the evolution of these fascinating insects, including the origins of broods, the mechanisms by which they can track the passage of time, the well-documented large emergences of a portion of a population exactly 4 yr early or late and their incredibly lengthy life cycles.For most of these questions, in-depth investigations have been hindered by the lack of quality genomic resources (Berlocher 2013;White and Pirro 2021;Simon et al. 2022).Although separated into multiple different broods, the members of each species group have experienced gene flow, presumably during coemergences made more frequent by the 4-yr early and late individuals (Fujisawa et al. 2018).Despite this gene flow, life cycles have maintained their integrity.Previous studies have used mRNA sequencing to investigate gene flow and construct molecular phylogenies and have found that between 13-and 17-year species pairs, genomic divergence is minimal except in the Decim group where one of the species, Magicicada tredecim, is reproductively isolated from the other two, Magicicada septendecim and Magicicada neotredecim (Fujisawa et al 2018).
Little is known about the demographic history and mechanism(s) of speciation within Magicicada.Several hypotheses have been proposed to explain multiple speciation events that have occurred in the history of these insects, most notably population fragmentation due to glacial events (Fujisawa et al 2018;Sota et al. 2013).Because periodical cicadas have been shown to time their emergence by measuring cumulative soil temperature (Heath 1968), glacial events in the last million years could provide a convenient explanation and impetus for speciation events by inducing early or late emergences that could lead to reproductively isolated populations (Cox and Carlton 1988).Scattered early and late bloomers, so to speak, have been observed appearing 1 or 2 yr before or after broods and large proportions of populations have been observed to emerge 4 yr early or late (Cooley et al. 2018;Simon et al. 2022).
While there is strong evidence that cicadas time their crawl to the surface based on accumulated soil temperature (Heath 1968), there are only unsupported hypotheses to explain how they can track the passage of years.One of the most promising hypotheses involves biological pathways that are triggered by changes in the chemical makeup of xylem fluid as the seasons change (Lloyd and Dybas 1966;Heath 1968).Annotated genome assemblies for Magicicada species will allow the discovery of genes that may be involved in these hypothesized pathways and could also allow for the comparison of genome sequences with other periodical species (Helioväära et al. 1994) or early/ late bloomers to find gene variants that could explain the divergence in behavior.As this is one of nature's most sophisticated biological clocks, the results will be fascinating and provide insights into the mechanisms whereby other species track the passage of time.Recent research into the genetic control of periodicity in long-lived bamboo (Zhang et al. 2021) and temperature-based RNA expression and editing (Birk et al. 2023) could also provide fascinating hypotheses of the molecular mechanisms for timekeeping in this genus.
Further highlighting the importance of this genomic resource is the ancient estimated divergence time of the superfamilies Cercopoidea and Cicadoidea (Johnson et al. 2018).This divergence represents 250 million years of evolution, which are unrepresented (Fig. 1C) in the current genomic data.Here, we fill this gap by sequencing, assembling, and annotating a chromosome-length genome for the periodical cicada, M. septendecula, commonly known as the "little 17-year cicada."

HiFi Assembly and Scaffolding with Hi-C Data
We sequenced a single individual male from the periodical cicada species, M. septendecula.Upon assembly, we found that the genome length is quite large compared to most other insects as measured by flow cytometry data (Hanrahan and Johnston 2011).At 6,521,820,903 bp, this genome assembly is almost 2.5× the size of the next largest Hemipteran assembly currently on NCBI (Biello et al. 2020).Repetitive elements were found to make up a large percentage of the genome, with a repeat content of 72.78% (35.64% classified) and a GC content of 35.25% as identified by RepeatMasker (v.4.1.2) (Flynn et al. 2020).Analysis with BlobTools (v.1.1.1)(Laetsch and Blaxter 2017) revealed several sequences that were categorized into Chordata as well as viral sequences (see supplementary fig.S1, Supplementary Material online)however, due to their considerable length and a match in GC content with the rest of the genome, we argue that it is likely that these sequences were misclassified due to a lack of database coverage, similar to findings in the genome assembly of the dragonfly Tanypteryx hageni (Tolman et al. 2023).S6).C) Phylogenetic tree illustrating the relationships between the included Hemipteran superfamilies (see Table 1) with relative branch lengths illustrating divergence dates estimated by Johnson et al. (2018).The dotted line subtending Cicadoidea and Cercopoidea indicates uncertainty in superfamily relationships as discussed by Skinner et al. (2020), Cao and Dietrich (2022), and Song and Zhang (2023).
Hi-C sequencing revealed 9 autosomes with 1 X chromosome (Fig. 1B; supplementary table S1, Supplementary Material online) in an XX/X0 sex-determining system, consistent with prior karyotyping of the genus Magicicada (Karagyan et al. 2020).PacBio HiFi sequencing and assembly followed by Hi-C scaffolding resulted in a highly improved genome assembly, with a scaffold N50 of 518.9 Mb, an L50 of 4, and 2,030 total scaffolds.Hi-C scaffolding thus improved our initial assembly into a strong genomic resource as potential pitfalls due to high repeat content (73%) and size (6.5 Gb) are addressed by the long-read sequencing technology as well as scaffolding with Hi-C data.Using NCBI's FCS tool, we trimmed and removed any contigs flagged as contaminants, ending with 10 chromosomelength scaffolds (95.49% of the assembly length) and 2,005 unplaced scaffolds (4.51% of the assembly length).We used tidk (v.0.2.41) (Brown et al. 2023) to search for telomeric repeats in the assembly and found telomeres on the ends of each chromosome-length scaffold (see supplementary fig.S2, Supplementary Material online).The complete BUSCO score for the Hi-C-scaffolded assembly was 96.7%.

Annotation
Five Illumina, paired-end Magicicada RNA-Seq libraries were trimmed and aligned to the M. septendecula genome (supplementary table S2, Supplementary Material online).Prior to quality control, total reads ranged between 22,101,494 and 85,009,400.Following quality control, that range fell between 22,092,126 and 85,000,848.RNA alignment rates were variable, falling between 81.29% and 94.83%.
The EASEL pipeline, which leverages generalized hidden Markov models and random forests to both predict and refine gene models, produced an unfiltered and filtered structural annotation.The unfiltered prediction, derived from multiple levels of RNA and protein support, captured 140,729 genes and 303,929 transcripts.This duplication was intentionally high to maximize gene sensitivity for downstream filtering.The mono:multiexonic ratio of 2.07 was indicative of fragmentation; however, despite an inflated number of false positives, 99.7% (S: 20.8%, D: 78.9%) of Insecta single-copy orthologs were captured.Following primary and secondary feature filtering using the invertebrate training set, EASEL predicted 22,785 genes and 83,621 transcripts with a mono:multiexonic ratio of 0.200 and a BUSCO completeness score of 96.4% (S: 24.9%, D: 71.5%).The functional annotation generated by EnTAP produced 57,260 unique RefSeq similarity search alignments (68.5%).With the addition of EggNOG gene family assignment, 81,978 sequences out of 83,621 were uniquely annotated (98.0%) (supplementary table S3, Supplementary Material online).The primary gene model (longest isoform) resolved the BUSCO duplication rate for the annotation but at the expense of completeness dropping to 91.2% (S: 89.2%, D: 2.0%), the mono:multiexonic ratio increasing to 0.209, and RefSeq alignments dropping to 59.4% (supplementary table S4, Supplementary Material online).This is slightly lower than the assembly BUSCO reported previously but still within a range that indicates a high-quality annotation.

Specimen Collection
Samples were collected and flash frozen on dry ice by Chris Simon and Stephen Chiswell in Knox County, TN and Wilkes Co., NC in May of 2021, during the Brood X emergence.After sequencing, the samples were stored as vouchers in the Bean Life Science Museum at Brigham Young University.For the complete sample metadata, please see Table S5.

Library Prep and Sequencing
We extracted DNA from a single Wilkes Co., NC male using the Qiagen GenomicTip high molecular weight DNA extraction kit.We then sheared the DNA to 18-kb fragments using a Diagenode Megaruptor and size-selected fragments of >10 kb using a SAGE Science BluePippin.We then generated a HiFi sequencing library using the PacBio SMRTbell Express Template Prep Kito 2.0.We sequenced the library across four 30-h SMRT cells on the PacBio Sequel II instrument at the BYU DNA sequencing center.Hi-C libraries were prepared and sequenced on an Illumina NextSeq by DNAZoo at Baylor College of Medicine using methods described in earlier publications (Rao et al. 2014;Dudchenko et al. 2017;Lamb et al. 2021;Tolman et al. 2023).A Knox Co., TN male was used for Hi-C library preparation.

Table 1
Hemipteran genome assembly statistics Chromosome-Level Genome Assembly and Annotation Physics Frontiers Center Award (NSF PHY-2019745) and by the National Human Genome Research Institute of the National Institutes of Health (RM1HG011016-01A1).DNA Zoo also acknowledges support from Illumina, IBM, and the Pawsey Supercomputing Centre.Additionally, C.W. was supported by NSF DBI 194337, and C.S. acknowledges support from NSF DEB 16-55891.Finally, this work was also supported by a BYU Life Sciences College Undergraduate Research Award.