The genome sequence of the scarce swallowtail, Iphiclides podalirius

Abstract The scarce swallowtail, Iphiclides podalirius (Linnaeus, 1758), is a species of butterfly in the family Papilionidae. Here, we present a chromosome-level genome assembly for Iphiclides podalirius as well as gene and transposable element annotations. We investigate how the density of genomic features differs between the 30 Iphiclides podalirius chromosomes. We find that shorter chromosomes have higher heterozygosity at four-fold-degenerate sites and a greater density of transposable elements. While the first result is an expected consequence of differences in recombination rate, the second suggests a counter-intuitive relationship between recombination and transposable element evolution. This high-quality genome assembly, the first for any species in the tribe Leptocircini, will be a valuable resource for population genomics in the genus Iphiclides and comparative genomics more generally.


Introduction
The scarce swallowtail, Iphiclides podalirius (Linnaeus, 1758), is a widespread butterfly species in the family Papilionidae. The species is common in open habitats in the Palearctic, ranging from France to Western China, but is absent from Northern areas (e.g. Scandinavia and the British Isles) and some Mediterranean Islands (e.g. Sardinia, where only occasional records exist). I. podalirius is generally bivoltine, the larvae feed mainly on different species of Prunus, principally P. spinosa, and overwinter in the pupal stage.
The genus Iphiclides belongs to the tribe Leptocircini (kite swallowtails), which diverged from other taxa in the subfamily Papilioninae about 55 million years ago (Allio, Scornavacca et al. 2020), and only includes two other species: I. podalirinus, which is restricted to Central Asia and Iphiclides feisthamelii, the sister taxon of I. podalirius, which is found in Northern Africa and the Iberian Peninsula. A controversy about the taxonomic status of I. feisthamelii, which has been regarded as a subspecies of I. podalirius (Godart and Duponchel 1832;Tolman and Lewington 2009;Wiemers and Gottsberger 2010), has only recently been resolved; although the two species have no known differences in ecology or life history and share mitochondrial haplotypes (Dinc a et al. 2015), Gaunet et al. (2019) show that they differ consistently in wing patterns (including UV reflectance of males), genital morphology and nuclear DNA and are separated by a narrow hybrid zone in Southern France (Descimon and Mallet 2009).
Although Allio, Scornavacca et al. (2020) have previously generated Illumina shotgun data for I. podalirius, the assemblies based on these data (Allio, Scornavacca et al. 2020;Ellis et al. 2021) are highly fragmented (with an N50 of 0.6 and 1.7 kb, respectively). More generally, while chromosome-level assemblies are available for several swallowtail butterflies in the genus Papilio (Lu et al. 2019), similarly contiguous genome assemblies are lacking for other tribes within Papilioninae.
Here, we present a chromosome-level genome assembly for I. podalirius, as well as gene and transposable element (TE) annotations. We use this assembly to investigate how heterozygosity in the reference individual varies both between genomic partitions and chromosomes.

Sampling
Two male individuals (MO_IP_500 and MO_IP_504) were collected with a hand net at Le Moulin de Bertrand, Saint-Martin-de-Londres, Montpellier, France, and flash frozen in liquid nitrogen. High-molecular-weight (HMW) DNA was extracted from the thorax of one of these individuals (MO_IP_504) using a salting out extraction protocol as described in Mackintosh et al. (2022).

Sequencing
A SMRTbell sequencing library was generated from the HMW extraction of MO_IP_504 by the Exeter Sequencing Service. This was sequenced on 3 SMRT cells on a Sequel I instrument to generate 24.0 Gb of Pacbio continuous long-read (CLR) data.
A chromium 10Â library was prepared from the HMW extraction at the Cancer Research UK Cambridge Institute, UK. This library was sequenced by Edinburgh Genomics (EG) on a single HiSeqX lane, generating 25.3 Gb of paired-end reads after processing with Long Ranger v2.2.2 (Marks et al. 2019). However, the weighted mean molecule size of these data (12.86 kb) limited its use for scaffolding of the Pacbio assembly and the reads were therefore only used for polishing. Hereafter, these data are simply referred to as Illumina WGS.
In addition, tissue from MO_IP_500 was used for chromatin conformation capture (HiC) sequencing. The HiC reaction was performed using an Arima HiC kit, following the manufacturer's instructions for flash-frozen animal tissue. An NEBNext Ultra II library, prepared by EG, was sequenced on an Illumina MiSeq, generating 4.7 Gb of paired-end reads.
Paired-end RNA-seq data were generated for individual MO_IP_504. To obtain tissue for RNA extraction, the adult individual was divided bilaterally (including all parts of the body: head, thorax, and abdomen). For further details on RNA extraction, see Ebdon et al. (2021).
The trimmed HiC reads were aligned to the contig-level assembly with Juicer v1.6 (Durand et al. 2016). Scaffolding was performed with 3d-dna v180922 using default parameters (Dudchenko et al. 2017). The scaffolded assembly and HiC map generated by 3d-dna was visualized and manually curated in Juicebox v1.11.08 (Robinson et al. 2018). A total of 7 contigs had their orientation changed through manual curation.
Following gene annotation, 5 0 and 3 0 gene flanks were defined as those that were 20 kb upstream and downstream of genes. We define regions as intergenic space if they are neither genic (start/ stop codons, exons, and introns) nor gene flanks. Bedtools intersect v2.27.1 (Quinlan and Hall 2010) was used to determine overlap (-wao) between TEs and genomic features. Following this, quantification and plotting were performed in R, using the tidyverse package (Wickham et al. 2019; RStudio Team 2020; R Core Team 2021).

Estimating heterozygosity
Heterozygosity was estimated within different partitions of the genome by first mapping the Illumina WGS reads to the assembly with bwa-mem, marking duplicated alignments with sambamba 0.8.1 (Tarasov et al. 2015), and calling variants with freebayes v1.3.2-dirty (Garrison and Marth 2012). Variants were normalized with bcftools v1.8 (Danecek et al. 2021), decomposed with vcfallelicprimitives (Garrison et al. 2021), filtered (QUAL>¼ 1), and subset to biallelic SNPs with bcftools. Callable sites were delimited with mosdepth v0.3.2 (Pedersen and Quinlan 2018), by applying a coverage threshold between 8 and 95 (inclusive). Callable sites were further restricted by removing all sites where there were indels or SNPs with QUAL < 1. Four-fold-degenerate (4D) and zero-fold-degenerate (0D) sites were identified using partition_cds.py v0.2 (see Data availability), based on the CDS BED file (where the 4th column contains transcript ID), the genome FASTA, and the VCF file. Bedtools v2.30.0 was used to intersect callable regions of the genome with intergenic, intronic, exonic, 4D, and 0D sites. Heterozygosity-the number of heterozygous bi-allelic SNPs divided by the number of callable sites-and callable sites of each genomic partition are listed in Table 1.

Genome assembly
We sequenced the genome of a male individual ( Fig. 1. a and Supplementary Fig. 1), suggesting that the flow cytometry may have produced an underestimate and that the Illumina-based assembly contains collapsed repetitive regions. We scaffolded the contigs with Arima HiC data (11.0x coverage) generated from a different male individual collected at the same locality (Fig. 1, c and d). Scaffolding generated 30 chromosome-scale sequences, which together account for 99.5% of the total assembly length and range from 6.8 to 21.1 Mb in size ( Supplementary Fig. 2). The assembly has a contig and scaffold N50 of 5.2 and 15.1 Mb, respectively. The BUSCO analysis shows that the assembly contains the majority of expected single-copy orthologs with little duplication (S: 96.5%, D: 0.2%, F: 0.4%, M: 2.9%). The Phred quality of the consensus sequence, estimated using solid kmers in the short-read data, is 35.8.
We assembled a circularized mitochondrial genome of 15,396 bases, containing 13 protein-coding genes, 22 tRNA genes, and 2 rRNA genes. The sequence can be aligned colinearly with the mitochondrial genome of Graphium doson (Kong et al. 2019), another species in the tribe Leptocircini, demonstrating that these mitochondrial genomes have not undergone any rearrangements.

Genome annotation
TEs compromise 32.81% of the genome assembly (Supplementary Table 1 and Fig. 2a). The assembly contains representation from all major TE types (Supplementary Table 1): the most abundant TEs are long-interspersed nuclear elements (LINEs), which constitute 11.01% of the assembly and 33.56% of total TE sequence. Recent activity is high in LINEs and there is also evidence for a very recent increase in LTR element abundance (Fig. 2b).
Considering all TE classifications, most TEs (70.14%) are found in intergenic regions. As expected, TEs are largely absent from exons with only 0.07% of exonic sequence consisting of TEs, likely due to the deleterious effects of TE insertions in exons (Sultana et al. 2017;Bourque et al. 2018). In contrast, a substantial fraction of intronic sequence (31.47%) is comprised of TEs (Fig. 2c). The most abundant TEs in the assembly, LINEs, comprise 14.25% of intergenic space, 12.14% of gene flanks, 8.98% of intronic regions, and 0.69% of exonic regions (Fig. 2c).
We annotated the assembly with 17,826 protein-coding genes, coding for 20,222 transcripts (1.13 transcripts per gene). At least 1 Pfam domain was identified along proteins of 9,363 genes (52.5%) Fig. 1. Fore and hind wings of the two I. podalirius individuals used to generate the genome sequence. a) Dorsal and b) ventral surface view of wings of specimen MO_IP_504, used to generate Pacbio long-read, Illumina WGS short-read, and Illumina RNA-seq short-read data. c) Dorsal and d) ventral surface view of wings of specimen MO_IP_500, used to generate HiC data. and start codons were found in genes coding for 20,163 proteins (99.71%). The BUSCO score of the protein sequences (S: 66.9%, D: 12.8%, F: 5.5%, M: 14.8%) is lower than the score of the genome sequence (see above) with an expected increase in duplicated BUSCOs.
The median length of genes is 4.0 kb, with the majority (51.7%) consisting of 4 exons or fewer. The number of gene predictions is higher than in genome annotations for species in the genus Papilio, such as Papilio dardanus (12,795, Timmermans et al. 2020) and Papilio bianor (15,375, Lu et al. 2019), but far lower than in the annotation of the Parnassius apollo genome (28,344, Podsiadlowski et al. 2021).
The density of annotated genomic features differs across chromosomes ( Fig. 3 and Supplementary Table 2). For example, the proportion of sequence made up of TEs ranges from 24.4% on chromosome 7 to 39.4% on chromosome 29. Similarly, exon density also ranges approximately two-fold across chromosomes, from 3.3% on chromosome 25 to 6.4% on chromosome 30. The density of TEs is negatively correlated with chromosome length (Spearman's rank correlation, q ð28Þ ¼ À0:520, P ¼ 0.004), whereas exon density and chromosome length are uncorrelated (Fig. 3).

Genome-wide heterozygosity
Across the genome assembly, we identified 2,514,242 heterozygous SNPs in the reference individual MO_IP_504, which is equivalent to a per-base heterozygosity (across all sites) of 0.00598 (Table 1). As expected, given selective constraint, heterozygosity in exons is less than half of that in introns and intergenic regions (Table 1). Likewise, within coding sequence heterozygosity is highest for 4D sites and lowest for 0D sites (Table 1), which is  expected given that the latter are under strong evolutionary constraint (Sawyer et al. 1987).
We note that our estimate of 4D site heterozygosity is comparable, but slightly higher, than p 4D estimates previously reported for I. podalirius based on transcriptome assemblies and data from two individuals (0.0052, 0.0057) (Mackintosh et al. 2019;Ebdon et al. 2021). This difference most likely reflects the fact that transcriptome assemblies are biased toward highly expressed transcripts, which experience greater indirect effects of purifying selection (Marek and Tomala 2018).

Conclusions
We describe a chromosome-level genome assembly for the scarce swallowtail butterfly I. podalirius, with gene and repeat annotations that are similar to previously published Papilio butterfly genomes. By contrast, the number of gene predictions and TEs in the P. apollo genome assembly is far greater (Podsiadlowski et al. 2021), suggesting gene and repeat expansions in the subfamily Parnassiinae.
When comparing heterozygosity in the reference individual both between chromosomes and between genomic partitions, we recover two well-documented patterns of genome-wide sequence variation, which result from the direct and indirect effects of selection, respectively: (1) stark differences in genetic diversity between genomic partitions reflecting differences in selective constraint and (2) a negative correlation between chromosome length and heterozygosity at putatively neutral 4D sites. The latter has been described for several species (including butterflies Martin et al. 2016) and is an expected consequence of the fact that the indirect effect of selection on nearby neutral sites depends on the rate of recombination (which tends to be greater for short chromosomes, Haenel et al. 2018).
We also find a negative relationship between chromosome length and TE density. This is surprising given that increased recombination rates on shorter chromosomes are expected to result in more efficient selection against TE insertions (Langley et al. 1988). Despite this, the observation that smaller chromosomes contain a greater density of repetitive elements has also been reported in Heliconius and Melitaea butterflies (Ahola et al. 2014;Cicconardi et al. 2021), suggesting that this may be a general feature of Lepidopteran genomes.
The I. podalirius genome will be valuable resource-not only for genomic analyses that investigate the forces driving genome evolution in the long term-but will also allow for detailed studies of the population-level processes that lead to the accumulation of barriers between species still experiencing gene flow.

Data availability
The genome assembly, gene annotation, and raw sequence data can be found at the European Nucleotide Archive under project accession PRJEB51340. The script used for calculating the site degeneracy (partition_cds.py) and the script used for visualizing HiC contacts (HiC_view.py) can be found at the following github repository: https://github.com/A-J-F-Mackintosh/Mackintosh_et_ al_2022_Ipod. The mitochondrial genome sequence and the TE annotation can be found at the same repository.
Supplemental material is available at G3 online.