Resistance to dieldrin evolution in African malaria vectors is driven by interspecific and interkaryotypic introgression

The evolution of insecticide resistance mechanisms in natural populations of Anopheles malaria vectors is a major public health concern across Africa. Using genome sequence data, we study the evolution of resistance mutations in the resistance to dieldrin gene (Rdl), a GABA receptor targeted by several insecticides, but most notably by the long-discontinued cyclodiene, dieldrin. The two Rdl resistance mutations (296G and 296S) spread across West and Central African Anopheles via two independent hard selective sweeps that included likely compensatory nearby mutations, and were followed by a rare combination of introgression across species (from A. gambiae and A. arabiensis to A. coluzzii) and across non-concordant karyotypes of the 2La chromosomal inversion. Rdl resistance evolved in the 1950s as the first known adaptation to a large-scale insecticidebased intervention, but the evolutionary lessons from this system highlight contemporary and future dangers for management strategies designed to combat development of resistance in malaria vectors


Introduction
The recurrent evolution of insecticide resistance in the highly-variable genomes of Anopheles mosquitoes [1][2][3] is a major impediment to the ongoing efforts to control malaria vector populations. Resistance to dieldrin was the first iteration of this cyclical challenge: this organochlorine insecticide was employed in a pioneering vector control programme in Nigeria in 1954, but resistant Anopheles had already appeared after just 18 months [4] due to a single dominant mutation [5,6]. Dieldrin use ceased in the 1970s due to its high persistence as an organic pollutant and unexpectedly wide toxicity, culminating in a ban in the 2001 Stockholm Convention on Persistent Organic Pollutants. However, resistance has remained strikingly persistent in natural Anopheles populations for more than 40 years [7]. The study of the genetic architecture of dieldrin resistance can thus provide key insights into the evolutionary 'afterlife' of resistance mechanisms to legacy insecticides. We address this issue by studying its emergence and dissemination in contemporary African populations of the A. gambiae species complex.
Dieldrin resistance in Anopheles spp. is caused by mutations in its target site, the -aminobutyric γ (GABA) receptor gene -also known as resistance to dieldrin locus, or Rdl [7][8][9]. Two resistance mutations have been found in anophelines, both in Rdl codon 296 (equivalent to codon 302 in Drosophila [7]): alanine-to-glycine (A296G) and alanine-to-serine (A296S). Populations of Anopheles gambiae sensu stricto (henceforth, A. gambiae) and its sister species A. coluzzii possess both 296G and 296S alleles [7,10], whereas the 296S allele is the only one reported in A. arabiensis and the more distantly-related malaria vectors A. funestus and A. sinensis [7,11,12]. Normally, dieldrin inhibits the activity of Rdl receptors, causing persistent neuronal excitation and rapid death; but codon 296 mutations confer resistance by reducing its sensitivity to the insecticide [13]. However, in the absence of exposure, Rdl mutations appear to carry fitness costs, such as lower mosquito mating success [14] or impaired response to oviposition and predation-risk signals [15,16] (although see [17]). Consequently, with seemingly limited current benefit via exposure to insecticides targeting Rdl, persistence of the mutations in anophelines is puzzling.
We interrogate the Anopheles gambiae 1000 Genomes cohort [3,18] to ascertain how often dieldrin resistance mutations have evolved in the A. gambiae/A. coluzzii species pair, and the mechanisms by which these alleles spread across Africa and may persist. We identify two distinct Rdl resistance haplotypes in these species, defined by hard selective sweeps and the perfect linkage of the 296G and 296S alleles with putatively compensatory mutations. Furthermore, the resistance haplotypes are across genomes from different species (A. gambiae, A. coluzzii and A. arabiensis), and across chromosomes with differing karyotypes in the 2La inversion (the longest inversion in Anopheles genomes) [19] within which Rdl resides. Inter-species reproductive isolation and inversions such as 2La both result in reduced recombination rates [20][21][22][23], which would in principle hinder the spread of these adaptive alleles. Here, we provide evidence that Rdl resistance alleles, which our structural modelling shows have divergent effects on the channel pore, underwent a rare combination of interspecific and interkaryotypic introgression.
Overall, we show that two founding resistance mutations spread with remarkable ease across geographical distance, species, and recombination barriers. This evolutionary trajectory has parallels with later-emerging target site resistance mutations, such as Vgsc [24][25][26][27]. The persistence of dieldrin resistance also challenges the efficacy of current and newly developed insecticides that also target Rdl [28][29][30], as well as the efficacy of rotative insecticide management strategies [31]. These results thus emphasise the influence of past interventions on current and future programmes of vector population control.

Distribution of Rdl resistance mutations across African populations
First, we investigated the genetic variation in Rdl across populations of the Anopheles gambiae species complex, including A. gambiae and A. coluzzii from the Anopheles gambiae 1000 genomes project (Ag1000G Phase 2, n = 1142) [3], and outgroups from four other species (A. arabiensis, A. quadriannulatus, A. melas and A. merus; n = 36) [32]. All genomes and their populations of origin are listed in Supplementary Material SM1. We identified six non-synonymous mutations that are segregating in at least one population at ≥5% frequency ( Figure 1A; complete list of variants in Supplementary Material SM2), including the 296G and 296S resistance alleles. 296G is present in West and Central African populations of both A. gambiae and A. coluzzii, with frequencies ranging from 30% (Cameroon A. gambiae) to 96% (Ghana A. gambiae). 296S is present in A. coluzzii specimens from Burkina Faso (63%), as well as A.
arabiensis (Burkina Faso, Cameroon, Tanzania) and A. quadriannulatus (Zambia). Resistance alleles occur as both homozygotes or heterozygotes in all species except A. quadriannulatus, which is always heterozygous ( Figure 1B).
We also identified two mutations in codon 345 with very similar frequencies to those of each codon 296 mutation: T345M (C-to-T in the second codon position), co-occurring with A296G; and T345S (A-to-T in the first codon position), co-occurring with A296S. The high degree of linkage disequilibrium between genotypes in codons 296 and 345 confirmed that they were co-occurring in the same specimens (Figure 2; e.g., the 296G/345M allele pair had a Huff and Rogers r and

Rdl resistance mutations evolved on two unique haplotypes in A. gambiae and A. coluzzii
The high frequency of the 296S and 296G alleles in various populations of A. gambiae and A. coluzzii    Figure 3A), which were present in a subset of mostly A. gambiae populations (Guinea, Ghana, Burkina Faso and Cameroon; Figure 1A); with just a few A. coluzzii from Côte d'Ivoire in the N530K cluster. Both N530K and H539Q are in partial linkage disequilibrium with 296G alleles (Figure 2).

The 296S and 296G alleles are associated with hard selective sweeps
Next, we investigated the signals of positive selection linked to the 296S and 296G resistance haplotypes. First, we found that haplotypes carrying 296G and 296S alleles had longer regions of high extended haplotype homozygosity (EHH) than the wt ( Figure 4A), as expected under a scenario of selective sweeps linked to these resistant variants. A closer examination revealed that EHH decays slower at the 3′ region of Rdl ( Figure 4A): in both clusters, EHH is above 0.95 (i.e. 95% of identical haplotypes) in the region downstream of codon 296 (exons 7 and 8), but decays more rapidly towards the 5′ of the gene (EHH < 0.20 in exon 6a/6b, EHH < 0.10 in exon 1). The core resistance haplotypes had lengths of 5,344 bp for 296G and 4,161 bp for 296S (defined at EHH > 95%), which were one order of magnitude higher than wt haplotypes (460 bp), and covered all non-synonymous mutations linked to codon 296 alleles (T345M, T345S, N530K, and H539Q).
Next, to estimate the softness/hardness of the sweep, we calculated the profile of Garud's H statistics [33] and haplotypic diversity along the 2L chromosome arm ( Figure 4B [25,27,35], which is the target site of pyrethroids and DDT [24]. Positive selection in Vgsc was particularly strong in chromosomes that also carried 296S alleles (H 12 = 0.917 +/− 0.004 standard error), followed by  each of them specific to one allele (296S and 296G), that underwent independent hard selective sweeps ( Figure 4).

Co-segregation of Rdl haplotypes and 2La inversions
Rdl lies within the 2La chromosomal inversion, which is the longest in the A. gambiae genome (20.5-42.1 Mb) [19]. The 2La inversion emerged in the last common ancestor of the A. gambiae species complex [32] and is currently polymorphic in A. gambiae and A. coluzzii [36], where it is linked to a range of important phenotypes including adaptation to human environments [37], aridity [38], insecticide resistance [39], and susceptibility to Plasmodium falciparum [40]. Given that recombination is strongly reduced between chromosomes with discordant inversion karyotypes [21][22][23], any assessment of the evolution of genes within the 2La inversion, such as Rdl, needs to take into consideration whether haplotypes reside in inverted (2La) or non-inverted (2L+ a ) backgrounds.   (using genomes with known inversion karyotypes as a reference; Figure 5A and Supplementary Material SM1 and SM9). The first principal component clearly discriminated between each of the inversion genotypes (non-inverted 2L+ a /2L+ a homozygotes, inverted 2La/2La homozygotes, and 2La/2L+ a heterozygotes). We used this information to compare the frequencies of 2La karyotypes with Rdl codon 296 genotypes ( Figure 5B), and the karyotype frequencies per population ( Figure   5C). The pan-African 296G allele is present in all inversion karyotypes, but is more common in non-inverted backgrounds (73% of 296G/296G homozygotes have 2L+ a /2L+ a karyotypes; Figure   5B), in both A. gambiae and A. coluzzii populations ( Figure 5C). On the other hand, 296S alleles from A. arabiensis and Burkinabè A. coluzzii occur exclusively within the 2La inversion (100% of 296S/296S homozygotes are in 2La/2La karyotypes; Figure 5B).

Introgression of Rdl resistance haplotypes
In order to obtain a more complete picture of possible introgression events, we performed a phylogenetic analysis of haplotype alignments at four loci around Rdl: 5′ and 3′ regions of the gene, and two loci upstream and downstream of the gene body ( Figure 6). These phylogenies highlight two events of interspecific introgression (explored below in grater detail): 296G between A. gambiae and A. coluzzii (as reflected by their identical swept haplotypes; Figure 3), and 296S between A. coluzzii and A. arabiensis. In addition, they also confirm the spread of 296G haplotypes across different 2La inversion types (interkaryotypic introgression; Figure 5). In the following paragraphs, we characterise these introgressions and attempt to identify the donors and acceptors of each event.
Interspecific introgression of 296G and 296S haplotypes All four phylogenies exhibit two main clades separating A. gambiae and A. coluzzii haplotypes according to their 2La inversion karyotype, rather than by species (2La in blue, left; 2L+ a in red, right; ultrafast bootstrap support [UFBS] 91% and 97% respectively; Figure 6A). This clustering is due the fact that the 2La inversion has been segregating in A. gambiae and A. coluzzii since before the beginning of their speciation [32].
Conversely, if 296S had introgressed from A. coluzzii into A. arabiensis, we would see evidence of introgression between 296S A. arabiensis and wt A. coluzzii, but we do not (right panel in Figure   11    The fact that only Gabonese A. gambiae have significant support as the 296G donor population could indicate that they are closer to the founding 296G haplotype and/or the original introgression event. However, the negative results in other populations harbouring 296G alleles (Cameroon, Guinea; Supplementary Material SM13) could also be due to methodological limitations of our analysis -e.g., our conservative approach is restricted to specimens that are homozygous for both the inversion karyotype (2L+ a /2L+ a ) and codon 296 (296G/296G or wt/wt); and the similarity between wt A. gambiae and A. coluzzii relative to the highly divergent swept haplotype can hinder the identification of the original background.
The 296G haplotype spread from 2L+ a to 2La chromosomes The haplotype phylogeny from the Rdl 3′ region, where codon 296 variants reside, also revealed that the 2L+ a clade (non-inverted, red; Figure 6A) contained a sub-cluster of 296G haplotypes from both 2L+ a (orange) and 2La orientations (purple; Figure 6A; UFBS 98%). The deep branching of 296G-2La haplotypes within the 2L+ a clade implies that 296G originated in a non-inverted background and later spread to inverted chromosomes via interkaryotypic introgression.
Chromosomal inversions are strong barriers to recombination, but double cross-overs or gene conversion events can result in allelic exchange between non-concordant inversions [21,22] and thus explain this phylogenetic arrangement.
However, the phylogeny of Rdl 5′ haplotypes (which excludes codon 296 and the adjacent nonsynonymous mutations) showed that 296G-2La sequences (purple) branched within the wt-2La clade instead (blue; Figure 6B). Thus, interkaryotypic introgression only affects the swept haplotype at the 3′ end of Rdl (Figures 3 and 4), whereas the 5′ region is closer to the wt. We can confirm whether the introgression is specific to the 3′ swept haplotype by examining the profile of sequence divergence along the Rdl gene locus (Dxy; Figure 8). We expect 296G haplotypes to be more similar to wt-2L+ a than to wt-2La, given that the 296G allele first evolved in a 2L+ a background (blue line, Dxy ratio > 1; Figure 8). In the case of 296G alleles from 2La chromosomes, this expectation holds at the 3′ region of Rdl but not at 5′ nor outside of the gene, where allele frequencies are more similar to the wt-2La (purple line, Dxy ratio < 1; Figure 8).

Structural modelling predicts that 296G and 296S disrupt the dieldrin binding site in alternative ways
Finally, we investigated the effects of 296G and 296S resistance alleles on the structure of RDL receptors. The A. gambiae RDL receptor was modelled as a homopentamer based on the human GABA A receptor structure [43] (Figure 9). In wt receptors, the 296A residue is located near the cytoplasmic end of the pore-lining second transmembrane helix (M2) and its side chain is orientated into the pore ( Figure 9A), whereas codon 345 is located away from the pore, at the cytoplasmic end of the M3 helix with its side chain orientated towards the lipid bilayer. We carried out automated ligand docking for dieldrin in the wt receptor, finding a putative docking site along the receptor pore with estimated free energy of binding ( Gb) of −8.7 kcal/mol ( Figure 9B) complex with picrotoxin showed that this ligand forms multiple hydrogen bonds with residues lining the pore [43], but dieldrin lacks equivalent hydrogen bond-forming groups. Thus, the close contacts between 296A side chains and dieldrin suggest that van deer Waals interactions between these molecules are the predominant binding interaction.

Evolution of Rdl resistance: selective sweeps and multiple introgression events
Contemporary dieldrin-resistant A. gambiae and A. coluzzii appear to descend from two unique hard selective sweeps around the A296G and A296S mutations (Figures 3 and 4). Both sweeps occurred independently on different genomic backgrounds (Figure 6), and have undergone at least three introgression events (Figures 6-8 In the case of 296G, our data supports an origin in A. gambiae with 2L+ a chromosomes, followed by coluzzii swept haplotypes and A. gambiae wt specimens from Gabon (according to Patterson's D test; Figure 7B). A. gambiae resistance haplotypes have accrued more non-synonymous mutations than A. coluzzii (N530K and H539Q; Figure 1A), which is consistent with a longer evolutionary history in the former. In either case, the swept haplotype currently spans populations of both species across West and Central Africa -mimicking the pan-African selective sweep described for the homologous Rdl mutation in D. melanogaster [8,9,44]. This result is in line with previous studies that had hypothesized the existence of a pan-African 296G sweep due to the strong genetic differentiation found in this locus [10].
The interkaryotypic introgression of 296G haplotypes from non-inverted 2L+ a into 2La chromosomes (Figures 6 and 7) also facilitated the spread of 296G resistance alleles, e.g. in A.
gambiae populations with high frequencies of 2La/2La karyotypes such as Burkina Faso ( Figure   5C). While it is generally acknowledged that chromosomal inversions strongly suppress recombination [20], genetic exchange can occur via double cross-over recombination or gene conversion [21,22,45,46]. The reduction in recombination is weaker in regions distant from the inversion breakpoints [21], as it is the case for Rdl (located ~4.8 Mb and ~16.7 Mb away from the 2La breakpoints), which results in reduced differentiation at the centre of the inversion [36,38] (Supplementary Material SM15). Few events of adaptive introgression across inversion karyotypes have been described in Anopheles. One of such cases are certain loci involved in adaptation to desiccation, which are linked to 2La inversions but are exchanged in 2La/2L+ a heterozygotes [38,47]. Another example, possibly linked to gene conversion, could be the APL1 cluster of hyper-variable immune genes: its pattern of sequence variation is more strongly influenced by geography and species (A. gambiae/A. coluzzii) than by the 2La inversion [48].
On the other hand, the 296S selective sweep has a more restricted geographical distribution. In the Ag1000G cohort, 296S is only found in A. coluzzii from Burkina Faso (Figure 3). We also identify 296S alleles in A. arabiensis specimens from East (Tanzania), Central (Cameroon) and West Africa (Burkina Faso); as well as two A. quadriannulatus specimens from Zambia (which appears to be the first record in this species; Figure 1B).

Persistence of Rdl mutations after dieldrin withdrawal
Rdl is a highly conserved gene, with an extreme paucity of non-synonymous mutations over >100 Mya of evolutionary divergence [1] in culicines and anophelines, and low d N /d S ratios that indicate a prevalence of purifying selection (Supplementary Material SM4). In this context, the persistence of 296G and 296S alleles in natural populations for more than 70 years, in spite of its fitness costs in the absence of insecticide [14][15][16], has been a long-standing puzzle.
Our study provides two key insights to this question. First, we find that, relative to the wt, haplotypes with resistance alleles have an excess of non-synonymous genetic diversity (~18x increase in π N /π S in 296G, ~4x in 296S). This observation suggests that the emergence of 296G and,  (Figures 3 and 4). This near-universal association is highly relevant because codon 345 mutations are suspected to have compensatory effects that offset the costs of codon 296 variants [49,50]. Studies of fipronil resistance have shown that both the 296G allele and the combination of 296G and 345M alleles resulted in decreased insecticide sensitivity in A.
Interestingly, our structural modelling analyses predict opposite resistance mechanisms for each resistance allele: 296G results in a wider RDL pore with weaker van der Waals interactions with dieldrin ( Figure 9C, E); whereas 296S narrows the pore and impedes dieldrin docking due to steric hindrance ( Figure 9D (296G with 345M, 296S with 345S). Yet, the exact nature of the interaction between these codon 296 and 345 mutations remains unclear. Firstly, residue 345 does not have direct contacts with dieldrin or residue 296 ( Figure 9A). Secondly, indirect effects are uncertain too: in human receptors, mutations on the interface between the third and second transmembrane domains (where residues 345 and 296 reside, respectively) affect the transition to the desensitized functional state [52]; but residue 345 in A. gambiae is not buried in this interface and is instead facing the lipid bilayer ( Figure 9A), and the predicted effects of mutations T345M and T345S are not obvious.
Other possible factors behind the persistence of Rdl resistance alleles include the long half-life of dieldrin as an environmental organic pollutant; as well as the fact that Rdl is the target site of other insecticides such as fipronil, isoxazoline or meta-diamides [28][29][30]; a secondary target of avermectin [29], and, possibly, of neonicotinoids (imidacloprid), pyrethroids (deltamethrin), [49], and DDT [53].  Material SM7 and SM8). Yet, this overlap is still relevant for vector control: as pyrethroid resistance increases in Anopheles populations [54], the search for substitutes should take into account that some can be rendered ineffective by 296S or 296G (e.g. fipronil [28], avermectin [29], or, possibly, neonicotinoids such as imidacloprid [49]). This risk is currently highest in the West and Central African populations of A. gambiae and A. coluzzii where both 296G and Vgsc 995F [25] are common (Supplementary Material SM8). In the future, the introgression of 296S from East African A. arabiensis could further compound current complications caused by the already high frequencies of Vgsc 995S in this region [25].

Implications for vector control
This case study of the mechanisms that underlie persistence of dieldrin resistance is also relevant for integrated resistance management. Strategies such as insecticide rotations or mosaics rely on a gradual decline in resistance over time [31]. Instead, 296G and 296S haplotypes have accumulated additional non-synonymous mutations ( Figure 3A), some of which (codon 345) are putatively compensatory. As mentioned above, a similar altered selective regime has also been observed in Vgsc haplotypes with kdr mutations [25]. Interestingly, a study of Brazilian Ae. aegypti found that Vgsc kdr mutations did not decrease in frequency after a decade without public pyrethroid spraying campaigns [55]. Brazilian Ae. aegypti have a longer history of pyrethroid-based treatments than African Anopheles spp. [55,56]; thus, their resilient kdr mutations could be (i) recapitulating our observations with respect to Rdl and dieldrin, and (ii) prefiguring a similar persistence of Vgsc kdr in the A. gambiae complex after a future phasing-out of pyrethroids in response to their decreasing efficacy [54].
Overall, our results show that the Rdl resistance mutations that appeared after the pioneering deployment of dieldrin in the 1950s will still be relevant in the immediate future. Continued monitoring is thus necessary to understand the evolving landscape of genomic variation that underlines new and old mechanisms of insecticide resistance.

Data collection
We used variation data from individual A. coluzzii and A. gambiae mosquitoes from the Anopheles gambiae 1000 Genomes online archives, for Phase 2-AR1 [18]. Specifically, we retrieved the phased genotype calls, SNP effect predictions, and the array of accessible genomic positions. We

Genotype frequencies and linkage disequilibrium
We retrieved all non-synonymous genomic variants located within the coding region of Rdl (genomic coordinates: 2L:25363652-25434556) that were biallelic, phased, and segregating at >5% frequency in at least one population (henceforth, 'non-synonymous variants'). Parsing and filtering of genotype calls from Ag1000G was done using the scikit-allel 1.2.1 library [58] in Python 3.7.4.

Haplotype networks
We constructed a network of haplotype similarity using 626 biallelic, phased and non-singleton (shared between more than two samples) variants located in a region +/− 10kbp of Rdl codon 296 (middle nucleotide, coordinate 2L:25429236). We used the presence/absence of each allele within this genomic region to calculate Hamming distances and build minimum spanning networks [61], using the hapclust function from [25] (with distance breaks >3 variants). Network visualizations were produced using the graphviz 2.38.0 Python library [62], with haplotype clusters being colorcoded according to species, population and presence/absence of the resistance alleles in codon 296 (296S, 2L:25429235; 296G, 2L:25429236) and the 995th codon of Vgsc (Figure 3, Supplementary Material SM5, and SM6). The network visualization in Figure 3A excludes singletons and haplotype clusters with a cohort frequency <1%.
We calculated the sequence diversity (π) of each haplotype group in the same region (sequence_diversity function in scikit-allel), using a jack-knife procedure (iterative removal of individual haplotypes without replacement) [63] to estimate the average and standard error. We also calculated the sequence diversity in non-synonymous coding variants from this region (π N ), synonymous coding variants (π S ), and their ratio (π N /π S ).

Positive selection in haplotype clusters
We analysed the signals of positive selection in three haplotype groups, divided according to Second, we calculated the profile of Garud's H statistics [33] along the 2L chromosomal arm (moving_garud_h utility in scikit-allel; block length = 500 phased variants with 20% step). We performed the same calculations for the haplotypic diversity (moving_haplotype_diversity in scikitallel). We calculated the Garud H and haplotypic diversity estimates in the Rdl locus, using a jackknife procedure [63] (iterative removal of individual haplotypes without replacement) to calculate the mean and standard error of each statistic.

Karyotyping of 2La inversions
In order to assign karyotypes of the 2La inversion in all specimens from Ag1000G Phase 2, we used known 2La karyotypes from Phase 1 as a reference [2], and analysed genotype frequencies within the inversion by principal component analysis (PCA). Specifically, we retrieved the genotype frequencies of 1142 specimens from Ag1000G Phase 2, 765 of which were also present in Phase 1 and had been previously karyotyped for this inversion [2]; and selected 10,000 random SNPs (biallelic, shared between more than two samples, phased, segregating in at least one population, and located within the 2La inversion 2L:20524058-42165532). SNPs fitting these criteria were selected using the scikit-allel Python library, and the PCA was performed using the randomized_pca utility (with Patterson scaling).
Manual inspection of the principal components (Supplementary Material SM9) showed that PC1 (6.35% of variance explained) was sufficient to discriminate between known karyotypes from Phase 1 using a clear-cut threshold (2La/2La, 2La/2L+ a and 2L+ a /2L+ a ). We determined the optimal classification thresholds using the C-Support Vector classification method (SVC, a method for supervised learning) implemented in the scikit-learn 0.21.3 Python library [64]. Specifically, we used the SVC function in scikit-learn (svm submodule) to train a classifier with known karyotypes from Phase 1 (765 observations) and the main principal components of the PCA analysis (10 variables), using a linear kernel and C=1. The selected thresholds were able to classify Phase 1 data into each of the three categories (2La/2La, 2La/2L+ a and 2L+ a /2L+ a ) with 100% accuracy (as per the classifier score value), precision and recall (calculated using the classification_report function from the scikit-learn metrics submodule).

Phylogenetic analysis of haplotypes
We obtained genomic alignments of SNPs located from four regions around the Rdl locus, at the Each genomic alignment was then used to compute Maximum-Likelihood phylogenetic trees using IQ-TREE 1.6.10 [66]. The best-fitting nucleotide substitution model for each alignment was selected using the TEST option of IQ-TREE and according to the Bayesian Information Criterion (BIC), which suggested the GTR substitution matrix with ascertainment bias correction, four gamma (Γ) rate categories, and empirical state frequencies observed from the alignment (F) (i.e. the GTR+F+ASC+G4 model in IQ-TREE). We calculated branch statistical supports using the UF bootstrap procedure [67,68] and refined the tree for up to 10,000 iterations, until convergence was achieved (correlation coefficient ≥ 0.99).

Interspecific introgression with Patterson's D statistic
We analysed the signals of introgression along the 2L chromosomal arm using Patterson's D statistic [41,42]. This statistic requires allele frequencies in four populations (A, B, C and O) following a predefined (((A,B),C),O) phylogeny, where A, B and C are populations with possible introgression events, and O is an unadmixed outgroup. Then, D > 0 if there is an excess of allele frequency similarities between A and C (which means either A → C or C → A introgression) and D < 0 for excess of similarity between B and C (B → C or C → B introgression) [41,42]. We calculated Patterson's D along blocks of adjacent variants in the 2L chromosomal arm (block length = 10,000 variants, with 20% step length; phased variants only) using the moving_patterson_d utility in scikitallel. We also calculated D in the Rdl locus (2L:25363652-25434556), and estimated its deviation from the null expectation (no introgression: D = 0) with a block-jackknife procedure (block length = 100 variants; average_patterson_d in scikit-allel). We then used these jack-knifed estimates to calculate the standard error, Z-score and the corresponding p value from the two-sided Z-score distribution.
Using the procedure described above, we performed multiple analyses of introgression between

Alignment of Rdl orthologs
We retrieved Rdl orthologs from the following species of the Culicidae family (available in three isoforms were retained). These sequences were aligned using MAFFT 7.310 (1,000 rounds of iterative refinement, G-INS-i algorithm) [74]. Pairwise sequence identity between peptide sequences was calculated using the dist.alignment function (with a identity distance matrix, which was then converted to a pairwise identities) from the seqinr 3.4-5 library [75], in R 3.6.1 [76].
Pairwise d N /d S ratios were calculated from a codon-aware alignment of CDS sequences, using the dnds function from the ape 5.3 R library [77]. The codon-aware alignment of full-length CDS was obtained with PAL2NAL [78], using the peptide alignment as a reference. Tables of pairwise identity and d N /d S values have been created with pheatmap 1.0.12 [79].

Supplementary Material SM8. Co-segregation of Rdl and Vgsc mutations. A-B) Frequency of alleles in
Vgsc codon 995 and Rdl codon 296 per population, calculated per chromosome. Note: A. gambiae populations denoted with an asterisk (The Gambia, Guinea-Bissau and Kenya) are listed separately due to their high frequency of hybridisation and/or unclear species identification (see Methods). C) Geographical co-occurrence of Rdl and Vgsc mutations, at 10% and 30% frequency thresholds (chosen for illustrative purposes). Dots indicate presence. D) Euler diagrams and contingency table depicting the co-occurrence of Vgsc 995F and 995S alleles with Rdl 296G, 296S and wt alleles within chromosomes analysed in this study (n = 2356). For chromosomes carrying each of the Rdl haplotype groups, we include the percentage of associated genotypes at Vgsc codon 995. E) Number of chromosomes carrying 296S or 296G mutations (x axis) against number of 995F mutations (y axis), per population (only values >0 included). F) Contingency tables of Rdl and Vgsc resistance mutations co-occurrence, per population. Only populations were resistance alleles in are segregating in both genes are included. p values and odds ratios [OR] correspond to Fisher's exact tests (one-sided, testing for a greater co-occurrence of Rdl codon 296 and Vgsc 995 resistance alleles).

Supplementary Material SM12. 296S introgression between A. coluzzii and A. arabiensis. A) Profile of
Patterson's D in 2La/2La backgrounds, using A. coluzzii specimens as populations A and B (296S and wt, respectively); A. arabiensis as population C (296S as positive controls, wt as test), A. merus as a negative control for population C (wt); and either A. christyi or A. epiroticus as outgroups (wt). B) Profile of Patterson's D in 2La/2La backgrounds, using A. arabiensis specimens as populations A and B (296G and wt, respectively); A. coluzzii as population C (296S as positive controls, wt as test), A. merus as a negative control for population C (wt); and either A. christyi or A. epiroticus as outgroups (wt).
In all panels, the hypothesis under test can be summarised as follows: if 296S homozygotes from species i show evidence of introgression with wt homozygotes from species j but not with wt from i, it means that 296S originated in species j. Left plots depict the entire 2L chromosomal arm (orange lines demarcate 2La inversion), and rightmost plots focus on the Rdl locus (Rdl gene coordinates highlighted in red). D was calculated in sliding blocks of 10,000 phased variants (with 20% overlap). For each comparison, we report the mean value of D in the Rdl locus and use a block-jackknife procedure (block length = 100 variants) to estimate its standard error, a Z-score (standardized D) and p-value (that reflects deviation from the null expectation of D = 0). A. gambiae and A. coluzzii. A) Profile of Patterson's D in 2L+ a /2L+ a backgrounds, using A. coluzzii specimens as populations A and B (296G and wt, respectively); A. gambiae as population C (296G as positive controls, wt as test); and either A. quadriannulatus or A. melas as outgroups (wt). B) Profile of Patterson's D in 2L+ a backgrounds, using A. gambiae specimens as populations A and B (296G and wt, respectively); A. coluzzii as population C (296G as positive control, wt as test); and either A. quadriannulatus or A. melas as outgroups (wt). C) Profile of Patterson's D in 2La/2La backgrounds, using A. coluzzii specimens as populations A and B (296G and wt, respectively); A. gambiae as population C (296G as positive controls, wt as test); and A. merus as outgroup (wt). D) Profile of Patterson's D in 2La/2La backgrounds, using A. gambiae specimens as populations A and B (296G and wt, respectively); A. coluzzii as population C (296G as positive controls, wt as test); and A. merus as outgroup (wt).

Supplementary Material SM13. 296G introgression between
In all panels, the hypothesis under test can be summarised as follows: if 296G homozygotes from species i show evidence of introgression with wt homozygotes from species j but not with wt from i, it means that 296G originated in species j. Left plots depict the entire 2L chromosomal arm (orange lines demarcate 2La inversion), and rightmost plots focus on the Rdl locus (Rdl gene coordinates highlighted in red). D was calculated in sliding blocks of 10,000 phased variants (with 20% overlap). For each comparison, we report the mean value of D in the Rdl locus and use a block-jackknife procedure (block length = 100 variants) to estimate its standard error, a Z-score (standardized D) and p-value (that reflects deviation from the null expectation of D = 0).

Supplementary Material SM14. Diversity of 296G haplotypes in 2L+ a and 2La backgrounds. A)
Profile of EHH decay for each group of 296G haplotypes (296G in 2L+ a/ 2L+ a , 2La/2L+ a and 2La/2La backgrounds), built from 16,623 phased variants located +/− 150,000 bp from codon 296 (2L:25429236 position). B) Profile of haplotypic diversity along chromosomal arm 2L (sliding blocks of 500 variants with 20% overlap). C) Absolute sequence divergence (Dxy) between 296G alleles of 2L+ a background and wt resistance haplotypes of 2L+ a and 2La backgrounds. D) Absolute sequence divergence (Dxy) between 296G alleles of 2La background and wt resistance haplotypes of 2L+ a and 2La backgrounds. All values are calculated in windows of 20,000 kbp with 10% overlap.
Supplementary Material SM15. Genetic differentiation in the 2La inversion. Differentiation (Hudson's F ST ) along the 2L chromosomal arm between A. gambiae and A. coluzzii species, separated by their 2La karyotype (2La/2La or 2L+ a /2L+ a ). Panel A shows comparisons with A. gambiae with 2L+ a /2L+ a karyotypes, and panel B for A. gambiae with 2La/2La karyotypes. F ST estimates have been calculated in adjacent blocks of 5,000 phased variants with 20% overlap. Sub-panels at the right focus on the Rdl genomic locus. Note that interkaryotype comparisons have higher F ST in the 2La region than inter-species comparisons.