A Draft Genome Assembly of Culex pipiens pallens (Diptera: Culicidae) Using PacBio Sequencing

Abstract The Northern house mosquito, Culex pipiens pallens, serves as important temperate vectors of several diseases, particularly the epidemic encephalitis and lymphatic filariasis. Reference genome of the Cx. pipiens pallens is helpful to understand its genomic basis underlying the complexity of mosquito biology. Using 142 Gb (∼250×) of the PacBio long reads, we assembled a draft genome of 567.56 Mb. The assembly includes 1,714 contigs with a N50 length of 0.84 Mb and a Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness of 95.6% (n = 1,367). We masked 60.63% (344.11 Mb) of the genome as repetitive elements and identified 2,032 noncoding RNAs. A total of 18,122 protein-coding genes captured a 94.1% of BUSCO gene set. Gene family evolution and function enrichment analyses revealed that significantly expanded gene families mainly involved in immunity, gustatory and olfactory chemosensation, and DNA replication/repair.


Introduction
Mosquitoes are important vectors that can transmit a variety of infectious diseases and more than half of the world's population is threatened by mosquito-borne diseases, causing a huge burden to human health and the economy (Gething et al. 2011(Gething et al. , 2012Bhatt et al. 2013;Acevedo et al. 2015). Species of Culex pipiens complex (Diptera: Culicidae) are globally distributed and has been considered as major vectors of several diseases. Members of this complex including Culex quinquefasciatus, Cx. pipiens pallens, Culex pipiens pipiens, Culex pipiens molestus, Culex australicus, and Culex globocoxitus (Smith and Fonseca 2004;Harbach 2012;Russell 2012;Aardema et al. 2020). Culex pipiens pallens is the most widely distributed subspecies of Cx. pipiens complex in northern China and the primary vector of lymphatic filariasis, epidemic encephalitis (Fonseca et al. 2009;Turell 2012;Cano et al. 2014).
High-quality mosquito genomes are the important genetic resources for the studies of vector-borne biology and evolutionary of bloodsucking characters. To date, 30 mosquitoes (Culicidae) genomes have been public (NCBI, accessed November 25, 2020), including 27 Anopheles, 2 Aedes, and 1 Culex species. The assembly sizes (ca. 150-300 Mb) of Anopheles genomes are rather smaller than other two genera (>500 Mb in Culex and >1 Gb in Aedes). The only available Cx. quinquefasciatus genome has an assembly size of 579.04 (539.96 ungapped) Mb, 48,671 contigs and 3,171 scaffolds, and a contig/scaffold N50 length of 28.55/486.76 kb (table 1, Arensburger et al. 2010). Here, we assembled a de novo genome assembly of Cx. pipiens pallens using the Pacific Bioscience (PacBio) single-molecule real-time (SMRT) platform. We annotated the protein-coding genes, as well as repetitive elements and non-coding RNAs (ncRNAs). Gene family evolution across the main Diptera clades was analyzed, particularly focusing on those rapidly evolving families.

Sample Collection and Sequencing
The Cx. pipiens pallens strain used for sequencing was originally collected from Mengtougou (China, Beijing) in 1999, and has been maintained in the laboratory without exposure to any insecticides. Female adults emerging from pupae without feeding were prepared for sequencing: 100 for Illumina and PacBio whole genome and 50 for transcriptome, respectively. We extracted genomic DNA using the Qiagen Blood and CELL Culture DNA mini Kit, constructed a library of 350 bp insert size using the TruSeq DNA PCR-Free LT Library Preparation Kit and a library of a 40 kb-insert size using a SMRTbell DNA Template Prep Kit 2.0. Genomic RNA was extracted using TRIzol Reagent and library was constructed using the TruSeq RNA v2 Kit. Short-read libraries were subject to the paired-end 150 bp (PE 150) sequencing on the HiSeq NovaSeq 6000 platform. Long-read library was sequenced on the PacBio Sequel II system using the Sequel Sequencing kit v2.1 chemistry. All libraries were sequenced at Berry Genomics (Beijing, China).
Preliminary genome assembly and long-read polishing were performed using Flye v2.7 (Kolmogorov et al. 2019) with a minimum overlap between reads of 5,000, 50Â longest reads for an initial contig assembly and one round of selfpolishing ('-m 5000 -asm-coverage 50 -i 1'). Redundant heterozygous contigs were removed using three rounds of Purge_Dups v1.0.0 (Guan et al. 2020) based on read depth with a minimum alignment score of 50 and a minimum chaining score of 5,000 for a match ('-a 50 -l 5000'). Resulting nonredundant assembly was polished with Illumina short reads using two rounds of NextPolish v1.1.0 (Hu et al. 2020). Minimap2 v2.12 (Li 2018) was used as sequence aligner for above redundancy removal and short-read polishing. We detected potential contaminant sequences using HS-BLASTN (high-speed blastn, Chen et al. 2015) against the NCBI nucleotide (nt) and UniVec databases. To assess the assembly quality, we assessed genome completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 pipeline (Waterhouse et al. 2018) against insect reference gene set (n ¼ 1,367), and calculated the mapping rate by aligning PacBio long reads and Illumina short reads to the final genome assembly with Minimap2.
We estimated expansions and contractions of gene families using CAF E v4.2.1 (Han et al. 2013) with the approach of single birth-death parameter lambda and the significance level of 0.01. Function enrichment analyses of GO and KEGG categories were also performed for those significantly expanded families using R package clusterProfiler v3.10.1 (Yu et al. 2012) with the default significance values (P-value <0.01 and q-value <0.05).

Genome Assembly
Altogether, 140.47 Gb ($246Â) Illumina short reads and 142.74 Gb ($250Â) PacBio subreads and 6.59 Gb transcriptome data were generated. The mean and N50 length of the long PacBio subreads are 22,438.45 kb and 34.59 kb, respectively. After quality control, 124.07 Gb ($218Â) short reads were kept for the subsequent genome polishing.
Raw PacBio reads were assembled into 10,627 contigs by Flye and the obtained 1.7 Gb length of sequences almost reached three times size of Cx. quinquefasciatus genome. Although completed BUSCOs occupied 99.1%, the BUSCO completeness assessment (n ¼ 1,367) identified 1,270 (92.9%) reference genes as complete and duplicated BUSCOs. This result indicated that Flye assembly possessed a very high ratio of redundancy. After redundancy removal, polishing and contaminant detection, the final Cx. pipiens pallens assembly had a length of 567.56 Mb, which comprising 1,714 contigs. The contig N50 length was 839.22 kb, GC content was 36.76%, and BUSCO completeness was 95.6% (5% complete and duplicated, 0.4% fragmented, 4% missing). The assembly size and low ratio of redundancy (5%) suggested that most heterogeneous contigs had been successfully removed from the assembly of Cx. pipiens pallens. Overall, the assembly size of Cx. pipiens pallens was closer to that of Cx. quinquefasciatus, as well as an estimate of 540 Mb from reassociation kinetics (Rao and Rai 2008). However, compared with the Cx. pipiens pallens assembly in this study, the genome of congeneric Cx. quinquefasciatus had much lower contig contiguity and a high level of gaps (table 1). Genome alignment at a 0.1% sequence divergence (-asm5) using Minimap2 showed that 99.99% CPP assembly regions could be mapped to the genome of Cx. quinquefasciatus.
We predicted 18,122 protein-coding gene models using MAKER pipeline. Compare with Cx. quinquefasciatus, gene predictions of Cx. pipiens pallens had longer mean lengths of genes, exons and introns (table 1), indicating that high-quality gene prediction in this study. BUSCO completeness assessment identified 94.1% complete genes (n ¼ 1,367) using protein mode "-m prot. Thirty-six single-copy genes were removed by "symtest" in IQ-TREE and remaining 404 genes were used to phylogenetic inference. Phylogenetic reconstruction based on 155,458 amino acid sites was fully resolved with 100/100 node supports. Topology, that is, classification was also consistent with previous studies (Wiegmann et al. 2011). Culicomorpha (Culicoidea þ Chironomoidea) was sister to other dipteran species. Dating analyses revealed that stem of Culicidae originated from late . It almost emerged simultaneously with mammals and dinosaurs, which may provide opportunities of evolution of new feeding habits (i.e., sucking blood) for mosquitoes. Two Culex species diverged from late Neogene (2.72-3.54 Ma) ( fig. 1a).

Gene Family Evolution
Gene family evolution estimated using CAF E upon phylogenetic tree is shown in figure 1a. For Cx. pipiens pallens, there were 1,306 and 1,079 gene families experienced expansions and contractions, respectively. Among all of the changed genes, 180 genes (125 expansions and 55 contractions) were rapidly evolving gene families. Most significant expanded families were related to immunity, gustatory and olfactory chemosensation, and DNA replication/repair. Immunity-related families included chitin binding Peritrophin-A domain-containing protein (Terra 2001), Fibrinogen C domain-containing protein (von Huth et al. 2018, C-type lectin (Brown et al. 2018), helicase MOV (Balinsky et al. 2017), and Peptidoglycan recognition protein (Royet et al. 2011). Origin recognition complex subunit 5 (ORC5), mitochondrial DNA polymerase (containing anticodon binding domain) and Geminin genes involve in DNA replication and repair (Lee et al. 2009;Copeland 2010;Tang et al. 2017;Coulombe et al. 2019). Large expansions of immuneand DNA replication/repair-related genes explained the possible mechanism of adaptations to polluted, harsh environment  1..--Phylogeny, orthologs, and gene family evolution. (a) Phylogeny, dating and gene family evolution. Node values representing the number of expanded, contracted and rapidly evolving families, respectively. (b) Statistics of orthologs and paralogs. "1:1:1" represents shared single-copy genes, "N:N:N" as multi-copy genes shared by all species, "Culicidae" as orthologs unique to Culicidae, "Others" as unclassified orthologs, "Unassigned" as orthologs which cannot be assigned into any orthogroups. (c) Top twenty significantly expanded families with gene numbers of the families shown above the bars. (d) and (e) Function enrichment of GO (d) and KEGG (e) for significantly expanded gene families. Only the top twenty categories are shown. for mosquitoes, particularly their larvae. In Cx. quinquefasciatus, Arensburger et al. (2010) discovered the expansions of cytosolic glutathione transferases and cytochrome P450s adaptable to evasion of insecticides, but not the case in Cx. pipiens pallens. Mosquito chemosensation are crucial for host seeking, foraging, mating, and oviposition (Clements 1992), the expansions of gustatory and olfactory receptors may reflect olfactory behavioral diversity of the Cx. pipiens pallens in host and oviposition site choice (Kwon et al. 2006;Sparks et al. 2018). Further enrichment analyses of GO ( fig. 1d) and KEGG (fig. 1e) for those significant expanded gene families also reinforced above results, such as GO categories: chemosensation-related (odorant binding, olfactory receptor activity, sensory perception of smell), DNA replication-related (DNA replication, DNA-dependent DNA replication, DNA replication origin binding, DNA replication initiation, chromosome condensation). This is the first genome assembly for Cx. pipiens pallens. Considering the importance of Cx. pipiens pallens as a vector of several human pathogens, we hope insights from the genome resource will be helpful for advance the understanding of biological characters of this species and contribute to ongoing efforts to develop control measures of mosquitoes and mosquito-borne diseases.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.