A chromosome-level genome assembly of Neotoxoptera formosana (Takahashi, 1921) (Hemiptera: Aphididae)

Abstract Neotoxoptera formosana (Takahashi), the onion aphid, is an oligophagous pest that mainly feeds on plants from the Allium genus. It sucks nutrients from the plants and indirectly acts as a vector for plant viruses. This aphid causes severe economic losses to Allium tuberosum agriculture in China. To better understand the host plant specificity of N. formosana on Allium plants and provide essential information for the control of this pest, we generated the entire genome using Pacific Biosciences long-read sequencing and Hi-C data. Six chromosomes were assembled to give a final size of 372.470 Mb, with an N50 scaffold of 66.911 Mb. The final draft genome assembly, from 192 Gb of raw data, was approximately 371.791 Mb in size, with an N50 contig of 24.99 Kb and an N50 scaffold of 2.637 Mb. The average GC content was 30.96%. We identified 73 Mb (31.22%) of repetitive sequences, 14,175 protein-coding genes, and 719 noncoding RNAs. The phylogenetic analysis showed that N. formosana and Pentalonia nigronervosa are sister groups. We found significantly expanded gene families that were involved in the THAP domain, the DDE superfamily endonuclease, zinc finger, immunity (ankyrin repeats), digestive enzyme (serine carboxypeptidase) and chemosensory receptor. This genome assembly could provide a solid foundation for future studies on the host specificity of N. formosana and pesticide-resistant aphid management.


Introduction
Allium crop species are used worldwide as vegetables and play an important role in daily diets in Asia (Shahrajabian et al. 2020). These Allium vegetables contain various sulfur and organic compounds that exhibit anticancer activities and may be useful for the treatment and prevention of cancers (Asemani et al. 2019). During our previous investigation, we found that Neotoxoptera formosana (Takahashi) (Hemiptera: Aphididae) causes significant economic losses to Allium plant agriculture, particularly A. tuberosum. N. formosana, also known as the onion aphid, damages Allium plants by sucking the cell sap from the plants, spreading many plant viruses and defecating sticky honeydew (Wang et al. 2021;Lin et al. 2022).
Onion aphids can search for Allium plants but show a significant response to the repellent effects of volatile sulfides, which are released by the plants (Hori 2007), and substances such as rosemary (Hori and Komatsu 1997). This aphid survives all year round but there are 2 main hazard peaks in the year: March-May and July-September, in Guizhou province, China. Previous studies have found that the predatory gall midge, Aphidoletes aphidimyza, has a good control efficiency against N. formosana under laboratory conditions (Wang et al. 2021). The complete mitochondrial genome of N. formosana was sequenced and annotated by Song et al. (2021). Despite its highly specialized host range and significant economic losses to Allium plant agriculture, no genome information for N. formosana has been published. In this study, we provided a chromosome-level genome assembly of N. formosana. This is the first genome assembly of this genus and will provide important basic information for the study of aphid taxonomy, host plant specificity, and pesticide-resistant aphid management.

Sample collection and sequencing
The onion aphids used for sequencing were obtained from Puding County, Anshun City, Guizhou Province, China (105˚27 0 49 00 E, 262 6 0 36 00 N), in December 2020, and were reared in the laboratory at the Institute of Entomology, Guizhou University (Fig. 1). There were 90, 30, and 60 adult females used for PacBio sequencing, RNA-Seq analysis, and Hi-C sequencing, respectively. High-quality DNA was extracted using the QIAGEN DNeasy Blood and Tissue kit. For PacBio sequencing, a 20-kb insert-size library was constructed using the SMRTbell Template Prep Kit 2.0 and sequencing was performed on the PacBio Sequencer Sequel II. For the Illumina sequencing, the Truseq DNA PCR-free kit was used to construct PCR-free libraries with an insert size of 350 bp. For the RNA-Seq analysis, the TRIzol Reagent kit was used to extract RNA and libraries were constructed using the TruSeq RNA v2 kit. The Hi-C library construction was performed by Berry Genomics and included cross-linking, restriction enzyme (MboI) digestion, fragment end repair, DNA cyclization, and DNA purification. All the Illumina libraries were sequenced on a NovaSeq 6000, to achieve reads of 150 bp in length.

Genome assembly
The Illumina genomic datasets were assessed for quality and trimmed using BBTools v38.82 (Bushnell 2014), using the following steps: (1) the duplicated sequences were removed using clumpify.sh; (2) bbduk.sh was used to remove low quality bases (<Q20), sequences shorter than 15 bases and poly-A/G/C tails (longer than 10 bases). It was also used to correct bases from overlapping reads (qtrim ¼ rl, trimq ¼ 20, minlen ¼ 15, ecco ¼ t, maxns ¼ 5, trimpolya ¼ 10, trimpolyg ¼ 10, trimpolyc ¼ 10). The k-mer analysis was performed using Genomescope v2.0 (Vurture et al. 2017), with the maximum k-mer coverage of 10,000. The kmer frequency was calculated using the khist.sh script from BBTools (kmer length: 21). The PacBio raw long reads that passed quality control were assembled using wtdbg2 v2.5 (Ruan and Li 2020), with the parameters of "-X 300 -p 15 -k 0 -S 4 -e 2." Polishing of the assembly was performed using NextPolish v1.3.1 (Hu et al. 2020), with 1 round of long-read polishing and 2 rounds of short-read polishing. For short-read polishing, the reads were first mapped to the assembly using mini-map2 v2.22 (Li 2018), with default parameters, and the produced ".sam" files were converted to ".bam" using samtools v1.10 (Li et al. 2009). The haplotypic duplications were removed from the assembly using Purge_dups v1.2.5 (Guan et al. 2020), with "-a 70." To assign contigs to chromosomes, Juicer v1.6.2 (Durand et al. 2016) was first used to align the high-quality Hi-C reads to the assembly and the contigs were then scaffolded using 3D-DNA v180922 (Dudchenko et al. 2017), with default parameters. The generated prepseudochromosomes were manually corrected using Juicerbox v1.11.08 (Durand et al. 2016), based on the Hi-C contact maps, and the files were then imported into 3D-DNA to produce the final chromosomal assembly. The contaminated sequences were assessed and removed, using MMseqs2 v12-113e3 (Steinegger and Sö ding 2017), by blasting contigs against the nt and UniVec databases. The cleaned assembly was also uploaded to NCBI for an additional search for possible contamination. The assembly completeness was assessed using BUSCO v3.0.2 (Waterhouse et al. 2018), with searches against "insect_odb10." To assess the coverage of raw data, the reads were mapped to the assembly using Minimap2 v2.22. To investigate genome collinearity with Acyrthosiphon pisum and Rhopalosiphum maidis, MMseq2 v12-113e3 was first used to align protein sequences with "blastp," with parameters of "s 7.5 -alignmentmode 3 -num-iterations 4 -e 1e-5 -max-accept 5." The resulting files, together with the annotation file (all.gff), were then used as inputs to MCScanX for collinearity analysis and were visualized using TBtools v1.0692 (Chen et al. 2020).

Genome annotation
We annotated repeats, protein-coding genes, and noncoding RNAs from the assembly. To identify repeats, RepeatModeler v2.0.2a (Flynn et al. 2020) was first used to generate a de novo repeat library with the parameter of "-LTRStruct." This repeat library was then merged with the sequences from the RepBase-20181026 (Bao et al. 2015) database to form a more extensive repeat library, which was used as the input into RepeatMasker v4.1.0 (Smit et al. 2013(Smit et al. -2015. The protein-coding gene models were predicted using MAKER v3.01.03 (Holt and Yandell 2011), by integrating the predictions from 3 strategies. EVidenceModeler (EVM) was used for evidence weighting. The 3 strategies were: (1) ab initio prediction: BRAKER v2.1.6 (Hoff et al. 2016) was used to

Significance
Onion aphids cause significant economic losses to Allium plant agriculture, particularly A. tuberosum. However, there is very little knowledge of this aphid, in terms of genetics. To expand the genetic resource of this pest, we assembled the whole genome and found the expanding gene families of N. formosana. This will provide opportunities for future studies on N. formosana genetics and will eventually contribute to aphid control.
Gene functional annotation was conducted in 3 steps: (1) gene models were searched against the UniProtKB (SwissProtþTrEMBL) and nr databases. To search against UniProtKB, the sensitive mode (-very-sensitive -e 1e-5) was used for Diamond v2.0.11.149 (Buchfink et al. 2015) to obtain functional description; (2) gene models were searched against the Pfam, Smart, Superfamily, and CDD databases using InterProScan 5.48-83.0 (Quevillon et al. 2005)  To annotate noncoding RNAs, infernal v1.1.4 was used to annotate rRNA, snRNA, and miRNA by searching against the Rfam database. The tRNAs were annotated using tRNAscan-SE v2.0.9 (Chan and Lowe 2019) and the predicted tRNAs with low fidelity were removed with the "EukHighConfidenceFilter" script.
The prediction of gene family expansion and contraction within N. formosana, when compared with the other 15 species, was conducted using CAFÉ v4.2.1, with the model of single birth-death parameter lambda and a significance level of 0.01 (P ¼ 0.01). The significantly expanded/contracted gene families were then assigned to GO and KEGG categories, using R package clusterProfiler v3.10.1 (Yu et al. 2012), with the default parameters (P ¼ 0.01 and q ¼ 0.05).

Genome sequencing and assembly
We obtained 104.3 Gb of PacBio long reads, with a mean read length of 15.42 kb and an N50 length of 24.99 kb. The genome was predicted to be between 395.5 and 397.2 Mb, with extremely low heterozygosity (Fig. 2). We estimated that 31.22% of the assembly contained regions of repetitive sequences (Supplementary Table 1). Our initial genome assembly, which was assembled solely from PacBio reads, was 371.791 Mb and contained 357 scaffolds and 1259 contigs (Table 1). We found that 93.9% of this assembly contained complete BUSCO genes (1,367),   Table 2). The mapping-back rates for Illumina short and PacBio long reads were 96.79% and 92.95%, respectively, which indicated that our assembly had high coverage of the raw data. There were approximately 800 Mb of Hi-C data used to assign scaffolds and contigs onto the 6 chromosomes (Figs. 3 and 4).
The predicted genes had a mean length of 6,552.6 bp, a mean CDS length of 211.1 bp and a mean number of exons of 9.3 (Table 1).

Genome collinearity analysis
We found a relatively high level of conserved linkage between N. formosana (Nf) and Acyrthosiphon pisum (Ap) genomes, when compared with Nf and Rhopalosiphum maidis (Rm) genomes  . 3. Hi-C contact map of the Neotoxoptera formosana genome.
( Fig. 5). N. formosana chromosome 1 (NfChr1) was mostly collinear with ApChrX, with some small regions aligned to ApChrA2. We found that NfChr2 and NfChr6 had homologous regions on ApChrA1. The syntenic regions of NfChr3 were located on ApChrA1 and the entire of ApChrA3. Conservation was observed between NfChr4 and 5 and ApChrA2. When compared with the R. maidis genome, only NfChr1 showed a high level of conservation with RmChr3, whilst other chromosomes had syntenic regions scattered across the R. maidis genome.

Phylogeny
We used protein sequences from 15 species, together with the annotated N. formosana protein models, to construct a phylogenetic tree (Fig. 6). There were 254,609 (91.3%) gene models assigned to 19,010 gene families. Among 4,169 gene families that were present in all species, 1,273 and 2,896 were single-and multicopy families, respectively. In N. formosana, 14,037 genes were clustered into 9,838 families and 77 genes from 30 families were found to be specific to this species (Fig. 6). A total of 555,217 amino acid residues, obtained from 1,113 single-copy genes, were used for phylogenetic construction. Most lineages had UFB/SH-aLRT supports of 100/100, apart from Macrosiphini sacchari, which had supports of 99.9/94, and Aphis gossypii and Melanaphis sacchar, which had supports of 99.3/98 (Fig. 6). This phylogeny suggested that N. formosana and P. nigronervosa were sister groups.

Gene family evolution
When compared with the other species, we found that the number of expanded and contracted gene families in N. formosana was 629 and 3,067, respectively. There were 330 gene families that  showed significant expansion or contraction. The gene families with significant expansion were THAP domain, DDE superfamily endonuclease, zinc finger, immunity (ankyrin repeats), digestive enzyme (serine carboxypeptidase), and chemosensory receptor families (Fig. 7a). The Odorant receptors (ORs) gene family exhibited rapid expansion, in accordance with the GO enrichment analysis. The results of the KEGG pathway enrichment analysis showed that pathways involved in detoxification, immunity, and secondary metabolite synthesis were significantly enriched (Fig. 7b). Aphid ORs play an essential role in the perception of different host odors or pheromones (Liu et al. 2022). In this study, 16 ORs candidate genes were rapidly exhibited expansion in N. formosana. This rapid expansion might be associated with the feeding behaviour of N. formosana, which is an oligophagic aphid pest only feeding on different Allium species. Hori (2007) finds that N. formosana might use dipropyl trisulphide (extracted from Allium fistulosu) and diallyl disulphide (extracted from Allium tuberosum) as olfactory cues to search for the host plants based on Y-tube olfactometer. In order to understand the odor perception of N. formosana, future work should analyze the expression patterns of ORs genes in different issues and identify the functional analysis of ORs genes to different plant volatiles (Zhang et al. 2019).

Data availability
Genome assembly and raw sequencing data have been deposited at the NCBI under the accessions JAIWJD000000000 and SRR18085628, SRR18079676, SRR18079766 and SRR13334673, respectively. Genome annotations are available at the Figshare under the link: https://figshare.com/articles/online_resource/A_ Chromosome-level_genome_assembly_of_Neotoxoptera_formo sana_Takahashi_Takahashi_1921_Hemiptera_Aphididae_/ 19165817. Supplemental material is available at G3 online.