A high-coverage draft genome of the mycalesine butterfly Bicyclus anynana

Abstract The mycalesine butterfly Bicyclus anynana, the “Squinting bush brown,” is a model organism in the study of lepidopteran ecology, development, and evolution. Here, we present a draft genome sequence for B. anynana to serve as a genomics resource for current and future studies of this important model species. Seven libraries with insert sizes ranging from 350 bp to 20 kb were constructed using DNA from an inbred female and sequenced using both Illumina and PacBio technology; 128 Gb of raw Illumina data was filtered to 124 Gb and assembled to a final size of 475 Mb (∼×260 assembly coverage). Contigs were scaffolded using mate-pair, transcriptome, and PacBio data into 10 800 sequences with an N50 of 638 kb (longest scaffold 5 Mb). The genome is comprised of 26% repetitive elements and encodes a total of 22 642 predicted protein-coding genes. Recovery of a BUSCO set of core metazoan genes was almost complete (98%). Overall, these metrics compare well with other recently published lepidopteran genomes. We report a high-quality draft genome sequence for Bicyclus anynana. The genome assembly and annotated gene models are available at LepBase (http://ensembl.lepbase.org/index.html).


Data Description
The squinting bush brown butterfly, Bicyclus anynana, is a member of the remarkably speciose nymphalid subtribe Mycalesina, which is distributed across the Old World tropics (Fig. 1). B. anynana is an important model organism for the study of lepidopteran ecology, development, speciation, behaviour, and evolution [1][2][3][4][5][6]. B. anynana are found primarily in woodland habitats across East Africa (from southern Sudan in the north to Swaziland in the south), and adults are typically observed flying close to the ground, where they feed on fallen fruit [1]. Strikingly, B. anynana exhibits seasonal polyphenism, a form of phenotypic plasticity whereby individuals that develop during the wet season differ in behaviour, appearance, and life history to those that develop during the dry season [7][8][9]. Wet season butterflies are smaller, have shorter lifespans, are more active, and show larger and more conspicuous eyespots on their wings in comparison to dry season individuals. The genetic basis of this plasticity and its impacts on various other life history and developmental characteristics are ongoing research questions to which the availability of a B. anynana reference genome will contribute [10][11][12].

Sampling and sequencing
Genomic DNA was extracted from a B. anynana female that had been inbred via 7 generations of brother-sister matings. The captive laboratory stock population from which these individuals originated was established in 1988 from 80 wild-caught indi- viduals and has been maintained at large effective population sizes to minimise the loss of genetic diversity [1]. Two shortinsert libraries with insert sizes of 350 and 550 bp were constructed using Illumina TruSeq Nano reagents and sequenced (125 base, paired-end) on an Illumina HiSeq2500 at Edinburgh Genomics (Edinburgh, UK). DNA from a sister to this focal animal was used to construct four long-insert (mate-pair) libraries with insert sizes of 3 and 5 kb (2 of each) at the Centre for Genomic Research, University of Liverpool (Liverpool, UK); libraries of both insert-sizes were then sequenced on an Illumina HiSeq2500 and an Illumina MiSeq at Edinburgh Genomics (Table 1). DNA from a female descendent of the same inbred line was used to construct 2 long read libraries with insert sizes of 10 and 20 kb, sequenced on the PacBio platform at the Genome Institute of Singapore at ∼×10 coverage using 16 P6 SMRT cells. All raw data have been deposited in the Short Read Archive under the accessions given in Table 1. A total of 128.2 Gb of raw Illumina data was filtered for low-quality bases and adapter contamination using Skewer v. 0.2.2 [13], and both raw and trimmed reads were inspected using FastQC v. 0.11.4 [14]. Only 4 Gb of data (3.1%) was discarded, indicating the high quality of the raw data. Kmer frequency distributions were estimated using the "kmercountexact" program from the BBMap v. 36.02 package [15] and showed 2 major coverage peaks at ∼×105 and ∼×210 (Fig. 2). The first  peak (×105) represents the proportion of the genome that is heterozygous and has an approximate span of 87.7 Mb (18.4% of the genome; calculated as one-half of the area under the ×105 curve, from ×50 to ×150). The expected proportion of heterozygous sites given 7 brother-sister (full-sib) matings is 0.75 7 = 13.3%, or 63.5 Mb. Thus, the greater than expected heterozygosity is likely to be due primarily to selection against highly inbred individuals during the course of the inbreeding regime [16].

Contaminant filtering and assembly
Short-insert libraries were screened for the presence of contaminant reads using Taxon-Annotated GC-Coverage (TAGC) plots, or "blobplots" [17]. An initial draft assembly was constructed using the CLC assembler (CLCBio, Copenhagen) and compared to the NCBI nucleotide database (nt) using Megablast v. 2.3.0+ [18], and against the UniRef90 protein database using Diamond v. 0.7.10 [19]. Read coverage for each contig was calculated by mapping both libraries to the CLC assembly using CLC mapper (CLCBio, Copenhagen), and blobplots were generated using Blobtools v. 0.9.19.4 [20] using the "bestsumorder" rule for taxonomic annotation of contigs (Fig. 3). Contigs that showed a substantially different coverage relative to that of the main cluster of contigs and/or good hits to sequences annotated as non-Arthropoda were classed as putative contaminants. A total of 237 394 pairs of reads (∼59 Mb) that were classed as either "mapped/mapped" or "mapped/unmapped" to a putative contaminant were subsequently discarded from further analysis. Filtered libraries were reassembled using the heterozygousaware assembler Platanus v. 1.2.4 [21], with default parameters. Contigs were further scaffolded with the mate pair libraries using SSPACE v. 3.0 [22] and with 35 747 assembled B. anynana transcripts [23] using a combination of L RNA scaffolder [24] and SCUBAT v. 2 [25]. A final round of scaffolding was performed with PacBio long reads (fastq files error-corrected using the RS Preaassembler.2 protocol) using SSPACE-LongRead v. 1.1 [26]. Finally, gaps between scaffolds were filled using GapFiller v. 1.10 [27] and PBJelly v. 15.8.24 [28].
Our final assembly (v. 1.2) comprised 10 800 scaffolds spanning a total of 475.4 Mb, with a scaffold N50 of 638 kb (Table 2). The genome-wide proportion of G+C was 36.5%, while the number of undetermined bases (Ns) was 5.8 Mb (∼1.2% of the total span). We determined assembly completeness by mapping both genomic and transcriptomic reads from B. anynana (SRA whole genome sequencing accessions ERR1102671-8 and transcriptome accessions ERR1022636-7, ERR1022640-1, and ERR1022644-5, downloaded October 2016) to the genome using BWA mem v. 0.7.12 [29] and STAR v. 020201 [30], respectively. Over 99% of reads from the 2 short-insert libraries mapped to the assembly, suggesting that the vast majority of the genome represented by these data has been assembled. In addition, 94.9% of RNA-Seq reads mapped to the assembly, suggesting that the majority of transcribed genes are present. Gene-level completeness was assessed using CEGMA v. 2.5 [31] and BUSCO v. 2.0 [32]. The proportion of CEGMA genes "completely" recovered (n = 248) was 81%, increasing to 97% when partially recovered genes were included. The recovery of BUSCO genes specific to the metazoa (n = 978) was higher, at 98% for complete genes, increasing to 99% when partial genes were included. An almost complete set (99.2%) of BUSCO genes specific to the Arthropoda (n = 1066) was also recovered. In addition, CEGMA indicated a duplication rate of 1.1 while BUSCO estimated only ∼2% of genes were present in multiple copies. The high complete CEGMA/BUSCO scores suggestthat a good assembly has captured the majority of core metazoan/Arthropod genes in full length and that the fragmentation of genes across multiple scaffolds is low. In addition, the low duplication rates suggest that most genes are present in single copy, and thus that the genome does not include significant duplicated segments representing alternative haplotypes.

Annotation
Prior to gene prediction, we masked the B. anynana assembly for repetitive elements to minimise the number of  a N50: the length of the contig/scaffold at which 50% of the genome span is accounted for, given a list of sequences sorted by length. b numN50: the number of sequences required to reach the N50 sequence. c CEGMA/BUSCO notation: C, proportion (%) of genes completely recovered; D, duplication rate; F, proportion (%) of genes at least partially recovered (including complete genes); n, number of queries. Note that duplication rate (D) for CEGMA is given as the average number of (complete) genes recovered, whereas for BUSCO it is the proportion of complete genes recovered multiple times. BUSCO values are based on comparisons to the Arthropoda gene set.  [33] and input to RepeatMasker v. 4.0.5 [34] to mask the assembly.
Overall, approximately one-quarter of the assembly (122.6 Mb) was masked from gene prediction (Table 3). Gene finding was performed following a 2-pass approach [35]. Initial gene models were constructed with MAKER v. 2.31 [36] using HMMs derived from SNAP [37] and GeneMark-ES v. 4.3 [38] in conjunction with a recently published B. anynana transcriptome as evidence. MAKER gene models were then passed to AU-GUSTUS v. 3.0.3 [39] for refinement, resulting in an initial set of 26 722 predicted protein-coding genes. A set of basic filters was applied to remove likely spurious gene models (Table 4), resulting in the deletion of 4080 gene models. Protein sequences from the filtered 22 642 genes were annotated using BLAST searches versus UniRef90 and the NCBI non-redundant protein database (nr), and domains/motifs were described using InterProScan5 [40]. Summary statistics for the 22 642 predicted gene models are given in Table 5.  scaffolds despite being only marginally smaller than the largest lepidopteran genome, B. mori (Fig. 4a). Interestingly, B. anynana v. 1.2 encodes the highest number of proteins of the 10 species compared (Fig. 4b). Despite measures to eliminate potentially spurious ORFs caused by annotation error or by duplication, B. anynana encodes ∼3250 more genes than the diamondback moth P. xylostella, and ∼10 400 more than the swallowtail P. polytes. It is tempting to attribute the apparently high number of genes to the developmental plasticity and alternative seasonal forms with divergent morphologies and life histories in B. anynana. However, it remains to be determined whether the number of genes predicted in B. anynana is a function of its larger genome size or unusual life history characteristics, or if further curation of the v. 1.2 gene models will reduce the number of inferred genes.