Chromosome-level genome assemblies of the malaria vectors Anopheles coluzzii and Anopheles arabiensis

Abstract Background Anopheles coluzzii and Anopheles arabiensis belong to the Anopheles gambiae complex and are among the major malaria vectors in sub-Saharan Africa. However, chromosome-level reference genome assemblies are still lacking for these medically important mosquito species. Findings In this study, we produced de novo chromosome-level genome assemblies for A. coluzzii and A. arabiensis using the long-read Oxford Nanopore sequencing technology and the Hi-C scaffolding approach. We obtained 273.4 and 256.8 Mb of the total assemblies for A. coluzzii and A. arabiensis, respectively. Each assembly consists of 3 chromosome-scale scaffolds (X, 2, 3), complete mitochondrion, and unordered contigs identified as autosomal pericentromeric DNA, X pericentromeric DNA, and Y sequences. Comparison of these assemblies with the existing assemblies for these species demonstrated that we obtained improved reference-quality genomes. The new assemblies allowed us to identify genomic coordinates for the breakpoint regions of fixed and polymorphic chromosomal inversions in A. coluzzii and A. arabiensis. Conclusion The new chromosome-level assemblies will facilitate functional and population genomic studies in A. coluzzii and A. arabiensis. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

1. Although the authors have already tried multiple assembly tools for their ONT sequences, I can still observe several obvious mis-assemblies from the Hi-C heat maps. I would recommend two assembly tools that are designed for the ONT long noisy reads, which, as far I know, can perform better than those tools applied in the current work.
https://github.com/xiaochuanle/NECAT https://github.com/Nextomics/NextDenovo However, I am also aware that the performance of bioinformatics tools varies a lot owing to the variety of genomes inherited from biodiversity. Therefore, the authors do not have to apply a full genome analysis pipeline for the two software in case they cannot produce better assembly results compared to the ones you have. Response: Thank you for your assessment and suggestions. We believe that by trying multiple assembly tools our pipeline produced the best assemblies possible using the sequencing data for these mosquito species. As for the observed several "misassemblies" from the Hi-C heat maps, we are not completely sure what regions are in question. If they are located within the chromosomal arms, these are inner parts of the contigs assembled by CANU. The off-diagonal signals on the A. coluzzii Hi-C map are caused by the normalization and visualization algorithm of JBAT. We cannot precisely determine the real cause of these transchromosomal 2R-3R signals. They can be either misassemblies by CANU or technical problems with the Hi-C reads alignment in repeat regions. In any case, these signals occupy very small regions on 2R. If they are located in the debris region, then they indeed can be mis-assemblies but they are not part of the chromosome-level assembly. The debris region consists of haplotigs, assembly artifacts, and small genome contigs that have no Hi-C signal to be assigned to any scaffold. Most of these signals are caused by haplotigs. To address this issue, we provided a new figure 2 with lines separating chromosome arms. This new figure was obtained after removing debris regions and unlocalized scaffolds (X pericentromeric, autosomal pericentromeric, and Y unlocalized). These regions lack the Hi-C signal and, therefore, they were scaffolded with additional methods during validation (CQ analysis, satellite mapping, mapping of A. gambiae genes).
2. The authors may want to put some of the additional files, especially those figures and small tables, into one single additional file and name them as supplementary figures and tables to improve readability. Response: We have decided to keep the current format since it is easier to go to a specific supplementary figure or table just by clicking on the hyperlink in the text.
3. The authors may want to depict their results more carefully. For example, (1) the median read length was 3.8 kbp and 2.3 kbp for An. coluzzii and An. arabiensis, respectively, according to additional file 1, rather than 4 kbp and 2.2 kbp described in your main text; (2) Although CANU generated the third highest number of misassemblies, it also has a longer alignment length compared to the assemblies produced by WTBG2 et al. The long alignments could have contributed to those more mis-assemblies, which needs additional explanations; (3) I cannot achieve the conclusion of single copy genomic regions of 204.1 Mbp from additional file 5 (Page 7); and some others I won't point them out one by one here. Response: (1) The median read length in the text was corrected according to additional file 1.
(2) In our main text, we focused on two metrics (NG50 and the number of misassemblies) since they are good representative measures that can be understood by a broad audience. We sought a trade-off between explaining our decisions for a broader audience and correctness of our explanations. However, we provided full QUAST reports after each stage of our pipeline for experts in genome assembly. So, they would be able to derive their own conclusions considering more parameters than just the two mentioned above. We agree with the reviewer that the relationship between the number of misassemblies and alignment length exists and should be considered before deciding with which assembly to proceed further. While some assemblers tend to produce longer contigs (i.e., trying harder to resolve some complex regions), other assemblers take a more conservative approach and produce shorter contigs. Considering that scaffolding methods with Hi-C allow not only joining contigs but also correcting misassemblies, it is not 100% clear based on which metric an assembly should be chosen. Moreover, while it is natural to assume that longer alignment lengths tend to lead to a larger number of misassemblies, it may be not so obvious in our case since the reference genome is different from the assembled genomes. Thus, some "misassemblies" are not misassemblies at all. Overall, we made our decisions based on all metrics in the QUAST report as well as some manual investigations of obtained assemblies. We believe that a lot of nuances about these metrics represent interest only for experts in the genome assembly community and may be omitted for representation purposes. We think that the NG50 metric has some weak connection with alignment length and can be used as a representative one.
(3) To estimate the genome size and size of single-copy genomic regions we used methodology described in the genome size estimation tutorial [1]. The figure in additional file 5 shows distributions of 19-mers from Illumina reads for each species. We separated a 19-mer distribution into three consecutive ranges that correspond to error sequences, single copy sequences (i.e., haploid peak), and repeat sequences. We estimated the average 19-mer single copy coverage by finding maximum in haploid peak. After that, we calculated the area under the curve for the whole distribution range except sequencing error range. For obtaining genome length, the obtained area was divided by the coverage calculated in the previous step. The length of the single copy sequences was calculated in the same manner but only for the single copy sequence range. The formulas for calculating respective lengths are given in the Methods. It is important to mention one key assumption that was made about our data for analyses described above. Illumina data for both species were obtained by sequencing colonies of mosquitoes (i.e., mix of different individual genomes). Therefore, we expect a high heterozygosity rate.
We will be glad to answer how we derived other lengths if any questions remain.
[1] https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/ 4. End-to-end genome assembly, do you mean Telomere-to-Telomere genome assembly. Response: We have changed to "telomere-to-telomere." Additional Information: coluzzii and An. gambiae are fertile [9,10]. These species are highly anthropophilic and endophilic; they are often sympatric but differ in geographical range [11], larval ecology [12], mating behavior, [13], and strategies for surviving the dry season [14]. Genomic analyses have the power to infer how these different adaptations are determined and maintained. A recent study described a new taxon, designated Anopheles TENGRELA, that genetically is most similar to An.
coluzzii [15]. Still undiscovered and misidentified cryptic taxa could seriously confound ongoing genomic studies of Anopheles ecology and evolution of insecticide resistance.
The quality of genome annotation and analyses of any organism highly depends on the completeness of the assembly [16,17]. Draft genome assemblies of species with highlypolymorphic genomes, such as mosquitoes, may have many gene annotation problems: genes can be missing entirely, have missing exons or gaps, or be split between scaffolds. As a consequence, it is difficult to estimate the total gene number or gene copy number, both of which may be linked to important phenotypic traits. Genes of particular interest with respect to vectorial capacity are especially prone to such errors since they often belong to gene families: aquaporins, ionotropic, odorant, and gustatory receptors, immunity genes, insecticide resistance genes, and reproduction gene clusters [18][19][20][21][22][23][24][25]. A genome with missing information can also cause problems for correct analyses of transcriptome, epigenome, and population genomic data. An. gambiae, because of its epidemiological importance, was the first disease vector that had its genome sequenced by the Sanger method (2002 and updated in 2007) [26,27]. Since then, the AgamP4 assembly remains the standard chromosome-level genome reference for species of the An.
gambiae complex [28][29][30][31][32]. Using the An. gambiae genome as a reference for functional annotation and population genomic analyses in other species comes at the expense of losing important information on species-specific genetic architectures in the sequencing data. Moreover, the AgamP4 assembly has misassembled haplotype scaffolds, large gaps, incorrect orientation of 5 some scaffolds, and unmapped sequences [26,27]. The 16 Anopheles mosquito species genome project included several members of the An. gambiae complex [33]. Among them, the genome of Illumina-based chromosome-level assembly for this species. Also, a de novo genome assembly from a single An. coluzzii Ngousso mosquito was obtained using PacBio sequencing [36]. The high-quality AcolN1 assembly for the Ngousso strain was placed into chromosome context by ordering and orienting the PacBio contigs to the AgamP4 reference. Also, 40% of the unmapped sequences in AgamP4 were assigned to the appropriate chromosomal positions [36].
Although each current sequencing or scaffolding technology alone cannot provide a telomere-to-telomere genome assembly and each method has its limitations [37], a combination of complementary approaches can lead to a chromosome-scale assembly [38,39]. The ongoing revolution in sequencing and scaffolding methods urges researchers to undertake efforts to create new genome references that satisfy the modern standards. The Oxford Nanopore sequencing technology is a single molecule, real-time sequencing approach that utilizes biological membranes with extremely small holes (nanopores) and electrophoresis to measure the change in ionic current when a DNA or RNA molecule passes through the membrane [40]. This highthroughput technology generates exceptionally long reads and has been proven successful in genomic studies for a wide range of biological samples such as arboviruses [41], bacteria [42], plants [43], insects [44], and humans [45]. Hi-C is a groundbreaking technology that exploits in vivo chromatin proximity information to yield dramatically improved genome assemblies. In contrast to alternative scaffolding approaches, such as phasmid libraries or other mate-paired sequencing methods [33], the Hi-C method can produce chromosome-level genomic scaffolds [39,[46][47][48][49][50].
In this study, we tested the pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes using the following steps: (i) high-coverage Oxford Nanopore sequencing and assembly using high-molecular-weight genomic DNA from inbred individuals, (ii) gap-filling and error-correction using Illumina sequencing data, (iii) Hi-C scaffolding of Oxford Nanopore contigs to chromosomes, and (iv) evaluation and validation of completeness and contiguity of the assemblies. We developed new reference genome assemblies for An. coluzzii and An. arabiensis, which will facilitate studies for a deeper understanding of the biology and genetics of these major African malaria vectors. The presented assembly pipeline will accelerate progress toward creating high-quality genome references for other disease vectors.

Data Description
Here, we describe the pipeline for obtaining de novo chromosome-level assemblies for An. coluzzii MOPTI (AcolMOP1, NCBI:txid1518534) and An. arabiensis DONGOLA (AaraD3, NCBI:txid7173). To produce the highly contiguous assemblies, we adopted and modified a strategy that was recently used for these goals in fruit flies [51,52]. Briefly, we sequenced the genomes using the Oxford Nanopore technology, assembled contigs from long Nanopore reads, coluzzii and An. arabiensis, respectively. We aligned Nanopore reads from An. coluzzii and An.
arabiensis to the genome of the closely-related species An. gambiae (AgamP4) using minimap2 [53]. For An. coluzzii, the total number of aligned and unaligned reads was 3.3 M (99%) and 0.03 and Canu v1.8 [58]. In the case of the Canu v1.8 assembler, we obtained two assemblies: one consisting of unitigs (i.e., unambiguous reconstructions of the sequence) and the other consisting of contigs. We evaluated the contiguity of the draft assemblies using QUAST-LG [59] and estimated the genome sizes for An. coluzzii and An. arabiensis by a k-mer analysis of Illumina reads using Jellyfish [60]. For the An. coluzzii genome, the peak of the 19-mer distribution was at a depth of 54, and the genome size was estimated as 301. We obtained the following five assemblies for An. coluzzii: gambiae AgamP4 reference genome [26,27]), the Canu contig assembly showed better contiguity (Fig. 1A, Additional file 6). However, this assembly also featured the third-largest number of misassemblies. The Wtdbg2 assembly had the smallest number of misassemblies.
We obtained the following five assemblies for An. arabiensis: • 1,920 contigs of a total length of 298.2 Mbp produced by Wtdbg2; • 1,280 contigs of a total length of 289.7 Mbp produced by FLYE; • 687 contigs of a total length of 338.8 Mbp produced by Miniasm; • 521 unitigs of a total length of 277.2 Mbp produced by Canu; • 211 contigs of a total length of 298.5 Mbp produced by Canu.
The total length of all the assemblies except Miniasm is underestimated compared with the An.
arabiensis genome size estimated from Illumina reads (315.6 Mbp). The An. arabiensis NG50 values from any assembler were substantially larger than those of An. coluzzii (Fig. 1B
Among these tools, only GRAAL, SALSA2, and 3D DNA code repositories are actively maintained. We chose 3D-DNA and SALSA2 for scaffolding of our draft assemblies because 3D-DNA allows manual correction while SALSA2 can use the assembly graphs produced by assemblers. Since SALSA2 was designed primarily for unitigs, we assessed the performance of both tools on the An. arabiensis unitig and contig assemblies produced by Canu. We ran SALSA2 with the corresponding assembly graph for the two An. arabiensis assemblies. Our experiments showed that 3D-DNA has a tendency of aggressively splitting contigs as the number of rounds grows. We, therefore, chose to run 3D-DNA with only one round instead of the default three rounds.
The QUAST-LG computation of metrics for the original and processed An. arabiensis assemblies demonstrated that SALSA2 performs better than 3D-DNA (Table 2). Strikingly, the assemblies processed by 3D-DNA had a higher number of scaffolds and lower NG50 than in the original assembly. At the same time, 3D-DNA corrected about 700 misassemblies in the contig assembly. While SALSA2 did not show a substantial boost in contiguity for unitig assembly, it significantly improved contig assembly at the cost of introducing a small number of misassemblies (Additional file 11). We visually inspected the initial Hi-C contact heat maps produced by the scaffolding of the An. arabiensis assemblies (Additional file 12). The best heat map was generated by SALSA2 on the An. arabiensis contig assembly. For example, SALSA2 reconstructed the correct order of the scaffolds for the X chromosome except that one inversion was required to correct the orientation (upper left corner in Additional file 12a, b). On the other hand, the 3D-DNA heat maps were smoother (Additional file 12c, d) as the 3D-DNA software tends to remove repetitive sequences present in the original assemblies. We conclude that SALSA2 is better suited for scaffolding of contigs obtained from long reads. However, the results from both tools can be further improved by manual correction of scaffolds based on visual inspection of the Hi-C heat maps. Juicebox Assembly Tools (JBAT) v1.11 is the only tool currently available for manual correction of genome assemblies [74]. While 3D-DNA is designed to be loadable into JBAT for manual correction, we were unable to convert SALSA2 output to a data format that could be loaded and corrected in JBAT. Therefore, despite SALSA2 producing better scaffolding results, we decided to proceed with the 3D-DNA scaffolds obtained from the Canu contig genome assemblies for An. arabiensis (Additional file 12c) and An. coluzzii (Additional file 13). Both species assemblies required manual correction by reordering, changing orientation, splitting contig sequences, and allocating scaffold borders. The main goal of such manual correction was to obtain chromosome-level scaffolds without assembly errors, haplotype sequences, and assembly artifacts. We also tried to minimize the number of contig splits by the manual correction.
To improve our manual correction process, we used the following additional information about contigs and scaffolds in the assemblies. All contigs were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts based on the read-depth analysis as follows. Read-depth histograms were produced for the An. coluzzii and An. arabiensis draft assemblies (Additional file 14). In each read-depth histogram, we chose three cutoffs to capture two peaks of the bimodal distribution that correspond to haploid and diploid levels of coverage. The first read-depth peak resulted from the duplicated regions and corresponded to a "haploid" level of coverage. The second read-depth peak resulted from regions that are haplotypefused and corresponded to the "diploid" level of coverage. We also aligned contigs from each draft assembly to the An. gambiae PEST (AgamP4) assembly to obtain information about distribution of the contigs across the chromosomes.
Since the Hi-C signal must be stronger for adjacent sequence regions, we manually reordered and changed the orientation of contigs in each assembly to keep the Hi-C signal strong along the diagonal. We used the PurgeHaplotigs classification and the fact that haplotig sequences lead to parallel diagonal signals for moving these contigs into debris. We also moved the contigs with low Hi-C signal and the contigs classified as assembly artifacts to debris. The remaining contigs were reordered according to the Hi-C signal. After manual correction we obtained the final Hi-C contact heat maps for the chromosome-level genome assemblies of An.
Contigs in debris were further partitioned into several scaffolds. We performed chromosome quotient analysis (CQ) [76] to tentatively assign contigs in debris to the Y chromosome (CQ<0.1), X chromosome (CQ=2), and autosomes (CQ=1). We grouped contigs with CQ values less than 0.1 into the "chrY" scaffolds. In addition, we removed sequences from the chrY scaffolds if they were classified as autosomal (Ag53C, Ag93) or X chromosomal 16 (AgX367) in the previous studies [77][78][79]. These results showed that the An. coluzzii and An.
arabiensis assemblies contain 126 and 4 contigs from chromosome Y, respectively. We were unable to arrange contigs inside the chrY scaffolds into a correct order or orientation since the Hi-C signal was weak for these contigs.
For better understanding of the contig distribution among chromosomes, we also aligned known tandem repeats from the pericentromeric regions of chromosome X and autosomes to our assemblies. We retrieved the sequences from the debris contigs that belong to the pericentromeric region of chromosome X based on tandem repeats 18S rDNA and AgX367 and to autosomal pericentromeric regions based on satellites Ag53C and Ag93 [77][78][79]. Since the Hi-C signal is low for these contigs due to low complexity of the corresponding genomic regions, we were unable to determine their position inside the scaffolds forming chromosomes. Therefore, we grouped these contigs into separate "X_pericentromeric" and "Autosomal_pericentromeric" scaffolds.    Table 4). The relatively smaller number of genes mapped to the X chromosome of An. arabiensis agrees with its higher divergence from the An. gambiae X chromosome [7]. The gene alignments provide supporting evidence that the obtained assemblies have the correct gene content.  , we generated three whole-genome pairwise alignments for the following pairs of assemblies: AcolMOP1 and AgamP4, AaraD3 and AgamP4, AcolMOP1 and AaraD3 (Fig. 3, Additional file 16). We observed that no alignment pair had inter-chromosomal rearrangements between the assemblies. Gaps in whole-genome pairwise alignments between AcolMOP1 and AgamP4 and between AaraD3 and AgamP4 indicate that all chromosomes in the new assemblies have more genomic information in the pericentromeric regions than the AgamP4 assembly.
Overall, the pairwise alignments show high concordance of the AcolMOP1 and AaraD3 assemblies with the existing AgamP4 assembly. Finally, we assessed the presence of known tandem repeats in pericentromeric regions of chromosomes 2, 3, X, and in chromosome Y of the AcolMOP1, AaraD3, AgamP4, and AcolN1 [36] assemblies. In particular, we used the presence of the putative pericentromeric tandem repeat Ag93 [78,79] to assess the completeness of autosomal arms. We observed 27,364 and 33,460 Ag93 repeat copies in An. coluzzii and An. arabiensis assemblies, respectively, which are substantially greater than the 4,446 and 2,188 Ag93 repeat copies found in AgamP4 and AcolN1, respectively (Table 5). Moreover, while all hits in AgamP4 were located in the chromosome UNKN, 409/7003 and 1643/735 repeats were found in scaffolds corresponding to chromosomes 2/3 in the An. coluzzii and An. arabiensis assemblies, respectively (Additional file 19). This result indicates that the AaraD3 and AcolMOP1 chromosomes are more complete than the AgamP4 chromosomes. We also analyzed the presence of the Ag53C tandem repeat and its junction with the Tsessebe III transposable element, which are known to be located in the pericentromeric regions of the autosomes [78,83]. The An. coluzzii assembly contains the highest number (646) of these repeats and the An. arabiensis assembly contains a comparable number of the repeats to AgamP4, while the AcolN1 assembly contains just 41 repeat copies (Table 5). It should be noted that these repeats are not located in scaffolds assembled to autosomal chromosomes and can only be found in the Autosomal_pericentromeric scaffolds of the An. coluzzii and An.
arabiensis assemblies or in the chromosome UNKN of the AgamP4 assembly. Overall, the AaraD3 and AcolMOP1 assemblies contain more assembled sequences from the autosomal pericentromeric regions than the AgamP4 and AcolN1 assemblies.
For assessing completeness of the pericentromeric regions in the X chromosome and within the Y chromosome, we used AgX367 and AgY477 repeat sequences that are known to be located in the X and Y chromosomes, respectively [77,78]. It is important to note that AgX367 and AgY477 repeat sequences share a region of high similarity. AcolMOP1 and AaraD3 contain a much higher number of AgX367 copies than AgamP4 or AcolN1 do (Table 5). All these copies are found in the X_pericentromeric scaffold only (Additional file 19). While AcolMOP1 contains 7685 of AgY477 repeat copies in the scaffold that corresponds to the Y chromosome, they are absent in AaraD3. We also used AgY53B repeat and zanzibar retrotransposon to assess for the 22 presence of Y chromosome contigs in the assemblies. Similar to AgY477, these sequences are found in AcolMOP1 but absent in AaraD3. AcolMOP1 contains the highest number of AgY53B and zanzibar copies among all the studied assemblies. For validating scaffolds corresponding to the X chromosome, we further used the Ag113 and 18S rDNA sequences. Remarkably, only AcolMOP1 and AgamP4 assemblies contain full-length 18S rDNA sequences. The Ag113 sequence is widely present in AcolMOP1 and AgamP4, but absent in AaraD3 (Table 5).
Importanly, AcolMOP1 contains 1941 Ag113 copies in the X chromosome while all appearances of Ag113 in AgamP4 are located in the chromosome UNKN (Additional file 19). All these results indicate that the X and Y chromosomes are better assembled in AcolMOP1 than in AgamP4 or AcolN1. We conclude that we produced chromosome-level genome assemblies for An. coluzzii and An. arabiensis. The AaraD3 and AcolMOP1 assemblies have 98.1% of conserved singlecopy Diptera genes from BUSCO and contain pericentromeric heterochromatin sequences as well as sequences of the Y chromosomes and the rDNA cluster. Our assessments show that the AaraD3 and AcolMOP1 assemblies are of higher quality and more continuous than AgamP4 or AcolN1.

Genome rearrangements in An. coluzzii and An. arabiensis
The 2La. Inversions Xag are fixed in An. coluzzii and An. gambiae [86] and inversions Xbcd are fixed in An. arabiensis [86]. Inversion 2La is fixed in An. arabiensis but polymorphic in An. coluzzii and An. gambiae [86]. Our previous studies identified the X chromosome inversion breakpoints by aligning the An. gambiae PEST assembly with the Illumina-based An. arabiensis assemblies arabiensis, An. coluzzii, and An. gambiae [86], and its breakpoint regions are shared among the three species indicating a single common origin of the inversion [87].  To validate the new 2R microinversion in An. arabiensis, we aligned our AaraD3 assembly with AaraD2 [35], which is a super-scaffolded, Illumina-based AaraD1 assembly [33]. We found the new 2R microinversion in the AaraD2 assembly as well, confirming its presence the DONGOLA strain (Additional file 24). We also aligned AcolMOP1 with the reference-guided scaffolded AcolN1 assembly [36]  In addition to the rearrangements identified using these three species assemblies, we  [36]. Since our AcolMOP1 assembly extends farther into the heterochromatin than the AcolN1 assembly does, we were able to detect two times more rearrangements in the X chromosome. These misalignments could be due to order and/or orientation errors in the PEST genome assembly. However, they may also represent novel rearrangements segregating between An. gambiae and An. coluzzii. We recently described a new type of shared cytogenetic polymorphism in the incipient species, An. gambiae and An. coluzzii-an inversion of the satDNA location in relation to the proximal gene-free X chromosome band [83]. The findings suggest that structural variations can be common in the sex-chromosome heterochromatin of mosquitoes.

27
By combining long-read sequences generated by the Oxford Nanopore technology and long-range information produced by the Hi-C approach, we obtained high-quality reference genome assemblies for An. arabiensis and An. coluzzii. We demonstrated tremendous improvement in the completeness and contiguity of these species' genomes. Thus, these assemblies provide a valuable resource for comparative genomics, epigenetics, functional analyses, and population studies of malaria mosquitoes. To maximize the use of the data, tools, and workflows of this study, we present a pipeline for obtaining superior-quality genome assemblies for malaria mosquitoes based on Hi-C scaffolding of Oxford Nanopore sequencing contigs (Fig. 5). The pipeline illustrates successful approaches along with other approaches that we tried but discarded in the course of its development.

Mosquito sample collection for Oxford Nanopore sequencing
A single well-blood fed female mosquito was separated from the original cage. After oviposition, F1 progeny from this single female were inbred with each other for 4 days in a 46 oz paper popcorn cup. F1 females were given bloodmeal and, after 72 hours, eggs were collected from a single F1 female. The F2 progeny were reared under normal conditions and were inbred as before. After six rounds of inbreeding using this procedure, all F7 pupal progeny from a single F6 female were sorted by sex, collected, flash frozen in liquid nitrogen, and stored at -80°C.

Genomic DNA isolation
Genomic DNA was isolated from 20 inbred male pupae following a modified Qiagen  PRJNA615337, SRA: SRS6448831) were obtained from the previous study [90].

Quality control of Nanopore reads
Analysis and visualization of the long Nanopore reads were performed with the Nanostat and Nanoplot tools of the Nanopack software (de2018nanopack) [91]. Alignment of Nanopore reads to the genome of An. gambiae (AgamP4) was done using minimap2 (Minimap2, RRID:SCR_018550) [53]. A contamination analysis was performed with Kraken2 [54] using a custom database with addition of the An. gambiae [26,27] and An. coluzzii Ngousso [36] genomes.

Nanopore sequence assembly
Genome assemblies from the Nanopore sequencing data were obtained using wtdbg2  [61,62] and QUAST-LG [59]. For QUAST-LG, the An. gambiae (AgamP4) genome was used as a reference. An. gambiae is evolutionary more closely related to An. coluzzii (0.061 million years divergence) than to An. arabiensis (0.509 million years divergence) [8] and, thus, the reference-based metrics (such as NG50 and the number of misassemblies discussed below) was considered with caution. Using BUSCO, each assembly was queried for 2,799 conserved single-copy diptera genes, as well as for 978 conserved single-copy Metazoa genes.
A gene recognizing model was trained for the Agustus tool [92] in the BUSCO pipeline by using the Aedes aegypti genome [48].

Genome size estimation and polishing the genome assemblies
For genome size estimation and polishing assemblies obtained from Nanopore reads, Illumina short paired-end data were used. The NCBI SRX accession numbers were SRX3832577 for An.
coluzzii and SRX084275, SRX111457, SRX200218 for An. arabiensis. Quality control of the Illumina reads was performed with FastQC (FastQC, RRID:SCR_014583) [93]. Based on the FastQC analysis, reads were filtered by the quality and minimum read length, and TruSeq adapters were trimmed from reads using fastp v0.20.0 (fastp, RRID:SCR_016962) . The genome sizes of An. coluzzii and An. arabiensis and sizes of single-copy genomic regions were estimated by the k-mer analysis for k=19 based on the Illumina short pair-end reads using methodology described in the genome size estimation tutorial [94]. A 19-mer distribution was separated into three consecutive ranges that correspond to error sequences, single copy sequences (i.e., haploid peak), and repeat sequences. The average 19-mer single copy coverage was estimated by finding maximum in haploid peak. After that, the area under the curve was calculated for the whole distribution range except sequencing error range. For obtaining genome length, the obtained area was divided by the coverage calculated in the previous step. The length of the single copy sequences was calculated in the same manner but only for the single copy sequence range. The formulas for calculating respective lengths are the following: , where Lg is a genome length, Lsc is a single-copy sequence length, Cov is an average 19-mer single copy sequence coverage, freq is 19-mer frequency, K(freq) is the number of distinct 19mers with frequency equal freq (Y-axis), End -maximal frequency value, and Lb, Rb are borders of the range for haploid peak. For example, the length of single copy region for An.coluzzii genome is calculated as follows: The frequency distribution of 19-mers in all high-quality short reads was computed by Jellyfish (Jellyfish, RRID:SCR_005491) [60]. Racon (Racon, RRID:SCR_017642) [63], Medaka [64], and Nanopolish (Nanopolish, RRID:SCR_016157) [65] were used to correct nucleotide substitutions, insertions, and deletions. Pilon (Pilon, RRID:SCR_014731) [66] was run several times using Illumina reads.
Metrics for the original and processed assemblies were computed by QUAST-LG [59]. Visual inspection of the Hi-C contact heat maps was performed. All contigs in the assemblies were classified with PurgeHaplotigs software [75] into primary contigs, haplotigs, and assembly artifacts based on the read-depth analysis. Alignment of contigs to the An. gambiae PEST (AgamP4) assembly was used for obtaining information about distribution of the contigs across the chromosomes.

Chromosome quotient analysis
Since AgamP4 assembly does not contain chromosome Y, the chromosome quotient (CQ) analysis was performed using Illumina reads from female and male mosquito genomes to detect the presence of contigs from the Y chromosome. According to the original definition [76], for a given sequence Si, CQ(Si) = F(Si) / M(Si), where F(Si) is the number of alignments from female sequence data to Si, and M(Si) is the number of alignments from male sequence data to Si.
Therefore, the CQ method allows for the differentiation of Y sequences from autosome and X sequences. CQ calculation was performed at a 1 kb window for contigs or scaffolds. If the number of male reads was below 20, the CQ value of that particular 1 kb window was not used. Contigs or scaffolds with at least 15% of the 1 kb windows showing CQ values less than 0.1 were considered as Y-derived and were grouped into a separate scaffold called "chrY."

Gene mapping and pairwise alignments
The An. coluzzii MOPTI (AcolMOP1) and An. arabiensis Dongola (AaraD3) assemblies were validated by comparing them with the existing assembly An. gambiae PEST (AgamP4), representing the most complete chromosome-level anopheline genome assembly known to date.