Lineage-specific positive selection on ACE2 contributes to the genetic susceptibility of COVID-19

Abstract The Angiotensin-Converting Enzyme-2 (ACE2) gene, located on Xp22.2, attracts a great deal of attention because the protein it encodes is believed to be the functional cellular receptor for the new coronavirus (SARS-CoV-2). However, recent studies are controversial, especially concerning the intrinsic link between ACE2 diversity and COVID-19 susceptibility. Here, we conduct a population genetic study on ACE2 in 6354 individuals representing 210 present-day populations and 5329 individuals of ancient or archaic groups. We dissected the genetic architecture of ACE2 and identified two major haplogroups (hg) in East Asians, i.e. ACE2-hg1 (43%) and ACE2-hg2 (53%), while other populations harbor more diverse ACE2-hgs. Accordingly, there was a significant loss of ACE2 common variations in East Asians in contrast to the X-chromosome-wide and genome-wide patterns. Notably, association analysis between ACE2-hgs and COVID-19 severity in 1229 Han Chinese individuals with various levels of COVID-19 severity showed a higher risk of ACE2-hg1 (odds ratio = 1.56, P < 0.01) and a lower risk of ACE2-hg2 (odds ratio = 0.65, P < 0.01). Interestingly, ACE2-hg1 is in strong linkage disequilibrium with rs1849863-C, which is an assumed risk factor of elevated plasma ACE2 level and is related to a higher risk of COVID-19 severity, hospitalization and infection. Strikingly, remarkable signatures of positive selection were detected, especially on ACE2-hg2, and were traced back to 100 000 years ago (but rose to a strong level during the Bronze Age, 5000∼3000 years ago, in East Asians). The selection pressures could have stemmed from multiple sources, but pre-COVID-19 viral epidemics and pandemics might have been potential driving forces, which consequently contributed to the genetic susceptibility to COVID-19 within and between populations.


INTRODUCTION
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in humans has resulted in the outbreak of coronavirus disease 2019 (COVID- 19) and has already caused millions of deaths and hundreds of millions of confirmed cases around the world. Angiotensin-converting enzyme 2, the protein encoded by the ACE2 gene (Xp22.2), was identified as the functional receptor for SARS-CoV-2 as well as the earlier SARS-CoV [1][2][3][4].
ACE2 is widely expressed in various organs, which provides possible routes of cell entry for SARS-CoV-2. Higher ACE2 expression was ob-served in organs other than the lung, such as the intestine, heart and kidney [5]. Many studies have been focusing on the genetic expression profiles of ACE2 to predict COVID-19 susceptibility across worldwide populations, but these results remain controversial because a different relative abundance of ACE2 was inferred across populations [6][7][8].
Genetic variations of ACE2 might account for COVID-19 susceptibility but the diversity of ACE2 has yet to be fully understood. Some recent studies exclusively focused on searching for risk variants in ACE2. Nonetheless, the frequencies of loss-offunction (LoF)/missense variants are extremely low at ACE2 [9,10]. In particular, none of the population-based genome-wide association studies showed a significant signature of ACE2-linked variants within or between populations [11][12][13][14][15]. A recent study also suggested a population-specific response to SARS-CoV-2 as East Asian populations have higher frequencies of the expression quantitative trait loci (eQTL) variants associated with higher ACE2 expression [9]. Another study of the ACE2 gene suggested that the susceptibility of South Asians is more similar to that of East Asians rather than West Eurasians, according to the haplotype similarity [16].
There are good reasons to believe that the genetic susceptibility to COVID-19 linked to ACE2 was driven by some historical evolutionary events rather than being recently established due to a new coronavirus. ACE2 is also a primary modulator of the renin-angiotensin system (RAS) and plays a crucial role in regulating blood flow, pressure and fluid homeostasis. It has a beneficial role in many diseases such as hypertension, diabetes and cardiovascular disease, where its expression is decreased [17]. ACE2 was shown to be associated with cardiovascular diseases, especially hypertension in rat models of high blood pressure [18]. Despite the potential importance of the ACE2 gene in the cardiovascular system, controversial results were reported from association studies [19][20][21][22][23], likely due to differences in various factors such as subtypes of diseases, study design, population ethnicities and sample size.
Substantial disparities among different ethnic groups in COVID-19 outcomes have been reported by recent studies [24]. Therefore, in this study, we attempted to dissect the genetic architectures of the ACE2 gene in worldwide populations, and systematically investigated the genetic diversity, haplotype structure and natural selection of ACE2. The purpose of this study is to identify the genetic determinants at ACE2 associated with genetic susceptibility to COVID-19 and to elucidate the evolutionary mechanism of ACE2 diversity within and between populations. In particular, we propose that some an-cient natural selection on ACE2 likely contributed to lower genetic susceptibility to COVID-19 in East Asia (EAS).

Dissecting the genetic structures of ACE2
The majority (∼88%) of ACE2 sequences in worldwide populations can be clustered into four haplogroups, i.e. ACE2-hg1 (36%), ACE2-hg2 (31%), ACE2-hg3 (6%) and ACE2-hg4 (15%), as revealed by hierarchical clustering based on the identicalby-state (IBS) distance matrix (Methods, Fig. 1A) and confirmed with principal component analysis (PCA) (Fig. S1). Haplogroup informative markers (HIMs) were then determined based on the highly differentiated alleles across the four major haplogroups (Methods). An HIM was defined as a variant with one allele fixed in at least one of the haplogroups and the other allele fixed in the remaining haplogroups. A total of 39 HIMs of high quality and strong linkage were identified for further haplogroup classification (Fig. 1B, Supplementary Data , Table  S1). Notably, ACE2-hg1 and ACE2-hg2 shared 87% of alleles at these HIMs.
We further investigated the prevalence of ACE2hgs across continental populations ( Fig. 2A, Fig.  S2). Notably, only two major haplogroups (43% ACE2-hg1 and 53% ACE2-hg2) were observed in EAS, with a third minor kind in low frequency (3% ACE2-hg3). In contrast, there were three major haplogroups (40% ACE2-hg1, 21% ACE2-hg2 and 31% ACE2-hg4) in West Eurasia (EUR), and a similar pattern in America (AMR). In South Asia (SAS), all of the four haplogroups are relatively common (24% ACE2-hg1, 47% ACE2-hg2, 8% ACE2-hg3 and 19% ACE2-hg4). As expected, haplogroup diversity is much larger in Africa (AFR); apart from the four common haplogroups (35% ACE2-hg1, 9% ACE2-hg2, 13% ACE2-hg3 and 8% ACE2-hg4), a few other kinds in relatively low frequencies (C) Divergence tree of ACE2 sequences across worldwide populations. Sequences from the same populations and of the same haplogroups are grouped and named after their population names and haplogroups, i.e. pop-hap. We excluded minor haplogroups in modern human populations from this analysis. Sequences of archaic hominids were used as outgroups. The tree topology was supported by 100 times bootstrapping replicates by random sampling of 20 sequences for each replicate from each pop-hap group. Within-pop-hap TMRCA and cross-pop-hap divergence time were averaged across all the replicates. The maximum TMRCA and divergence time in each clade are displayed at the corresponding node, and the ranges of dates are provided in brackets. Statistical significance: * P < 0.05, * * P < 0.01, * * * P < 0.001.

Ancestral origins and evolution of ACE2 haplogroups
We constructed the haplotype network (Methods) for the ACE2 sequences (Fig. 2B). ACE2-hg4 is found in the highest frequency in EUR and in a lineage different from that of the other three haplogroups since the early split. It also seems that ACE2-hg4 has a relatively higher genetic affinity with the archaic ACE2 sequences but is not inherited from archaic introgression based on the analysis with ArchaicSeeker2.0 [25]. ACE2-hg1 is much more basal than ACE2-hg2 and ACE2-hg3. Both ACE2-hg2 and ACE2-hg3 are likely to be derived from ACE2-hg1. Some other minor haplogroups might also be derived from ACE2-hg1, which was mainly found in AFR populations.
We further estimated the time to the most recent common ancestor (TMRCA) and divergence time for the ACE2 sequences (Methods). A phylogenic tree of ACE2 sequences was constructed using hierarchical clustering based on the pairwise divergence time (Fig. 2C, Fig. S4). The common ancestor of the four haplogroups was traced back to as early as 1400 KYA. Consistent with the haplotype network, ACE2-hg4 was located in an isolated lineage. It diverged with the other lineages comprising ACE2-hg1, ACE2-hg2 and ACE2-hg3, followed by the split from archaic ACE2 sequences at ∼600 KYA. Sublineages of ACE2-hg4 were formed in ∼300 KYA and found in both AFR and non-AFR populations. In contrast, ACE2-hg3 diverged earlier from the others at ∼750 KYA, but ACE2-hg1 and ACE2-hg2 diverged more recently at ∼250 KYA. The TMRCA estimated for each of the three haplogroups in non-African populations was as recent as <75 KYA, consistent with the Out-of-Africa (OOA) dispersal.
Analyses of ACE2 sequences in ancient samples indicated that ACE2-hg1/2 and ACE2-hg4 were already prevalent 50 000 years ago (KYA) in human populations (Fig. 3A), although ACE2-hg1 and ACE2-hg2 could not be distinguished in ancient DNA data [26] (Methods). Moreover, available sequences within the same haplogroups were identical in ancient samples dated from 50 KYA to those in present-day populations (Fig. 3B, Methods). Available archaic ACE2 sequences are also almost identical with respect to the analyzed variants in modern human data.
Notably, excess rare variations were observed for all haplogroups in all populations, as also indicated by significantly negative Tajima's D statistics [27] (Fig. 2C, Methods). These results, especially the excess of rare variants and relatively small TM-RCA, implied a recent selective sweep on the ACE2 gene and the quick expansion of ACE2-hg1 and ACE2-hg2.

Association between ACE2 haplogroups and COVID-19 severity
We further employed the ACE2 variations of Han Chinese COVID-19 patients [14,28,29] and examined the connections of ACE2 haplogroups with COVID-19 severity. A total of 1229 samples (601 males and 628 females) were analyzed, with COVID-19 severity ranked as asymptomatic, mild, moderate, severe and critical. Detailed information is provided in the Supplementary Data (Fig. S5, Table S3). Notably, females tended to be mild, while males were likely to have higher severity (odds ratio (OR) = 1.40, P < 1.44 × 10 -3 ). When individuals were grouped more broadly, i.e. asymptomatic, mild, moderate severities, and others, the pattern was retained (OR = 1.43, P < 2.07 × 10 -3 ). These results are in agreement with previous findings [30], i.e. that there must be some protective mechanism in females. Only male samples were used in the following analyses to investigate the variations of severity across individuals and to eliminate the potential biases caused by the phasing error, unbalanced chromosome copies across genders, and tangles of different haplogroups in the same female individual (see Supplementary Data).
The association between COVID-19 severity and ACE2 haplogroup type is expected to be related to the expression level of ACE2 and the biological roles of ACE2-hg1 and ACE2-hg2. A recent study has reported the casual effect of elevated plasma ACE2 levels on COVID-19 severity, hospitalization and infection [32]. The odds ratio is ∼1.6, which is similar to the estimation in our study. The lead variant (rs1849863) for one of the most significant protein quantitative trait loci (pQTL) was ∼100 kilobases (kb) upstream to ACE2. The effect allele (C) of rs1849863 was able to upregulate the expression of ACE2 (β = 0.164), suggesting rs1849863-C as a risk factor. Analysis in our study revealed a considerably high linkage between the rs1849863-C variation and ACE2-hg1 (β ≈ 0.7-0.8, P < 10 -15 ) (Fig. S7), while the connections between rs1849863-C and other haplogroups were much lower (β ≈ −0.2). Previous studies have reported the differentiated functions of alleles on ACE2-hg1 and ACE2-hg2 in the context of other diseases. For example, the T allele of rs2285666, carried by ACE2-hg2, is significantly associated with a lower risk of cardiovascular death in females of European ancestry [33]. In Han Chinese female subjects, alleles of both rs2074192 (T) and rs2106809 (A) that were carried by ACE2-hg1 were associated with reduced circulating angiotensin-(1-7) levels [34], which may partially account for the greater blood pressure response to changes in dietary sodium intake in the Chinese population [35] and also be responsible for the increased susceptibility to hypertension [36]. These findings suggest the differential risks of ACE2-hg1 and ACE2-hg2 to multiple human diseases.

Significant loss of common variations at ACE2 in EAS
Analysis of the genetic diversity of ACE2 in worldwide populations (Methods) indicated the lowest genetic diversity in EAS populations, estimated by  the pairwise difference (θ π ), although there was a similar number of segregating sites (θ K ) across populations in Eurasia (Fig. S8). These results indicated a loss of common or ancient variants at the ACE2 gene in EAS populations. We further confirmed that the lower diversity of ACE2 in EAS cannot be explained by X-chromosome-wide diversity with the confounding factors under control (Fig.  S8, Methods), because an even lower genetic diversity of ACE2 was observed compared with the chromosome-wide background, suggesting the loss of common variants in the ACE2 gene driven by some local evolutionary force such as natural selection. Moreover, the relatively low cross-population variance of ACE2 diversity also suggested that natural selection as a driving force of ACE2 variations was prevalent in EAS.
To examine to what extent the selective sweep affected the flanking genes of ACE2, we scanned the region spanning ∼200 kb up and downstream of ACE2, encompassing ACE2 and another six genes including FIGF, PIR, BMX, TMEM27, CA5BP1 and CA5B. There was an obvious decrease in diversity at ACE2 and BMX compared with other flanking genes, especially in EAS (Fig. 5A and B). BMX is in strong linkage with ACE2 (Fig. S9) and encodes a non-receptor tyrosine kinase and participates in cardiac growth and endothelial signalling. Therefore, the two genes are highly relevant both physically and functionally, with natural selection affecting the diversity of both genes significantly. TMEM27 is ∼30 kb upstream of ACE2 and encodes a novel homolog of ACE2. However, no significant differentiation of allele or haplotype frequencies was observed among worldwide populations, and all the continental groups are of high nucleotide diversity at TMEM27. In addition, we collected 15 genes related to COVID-19 (COVID19Genes) ( Table  S5) and 64 genes related to immunity (Immune-Genes) on the X chromosome (Methods). Compared with the X-chromosome-wide protein-coding genes, COVID19Genes and ImmuneGenes, ACE2 also showed extremely low nucleotide diversity in EAS (Fig. S10).

Positive selection on ACE2-hg2 in EAS
A compound test of natural selection based on DHH statistics combining Tajima's D [27], Fay and Wu's H [37,38] and H12 [39] statistics was further applied to an analysis of ACE2 sequence data (Methods). Previous studies have demonstrated the robustness and power of the joint test of both sites-and haplotype-frequency-based methods [38,40,41]. EAS-specific signals of positive selection were revealed by the DHH statistics (P < 0.006) (Fig. 5C, Fig. S11), consistent with the observation of the reduced diversity in ACE2 in East Asian populations ( Fig. 5A and B, Fig. S8).
The DHH signals were also supported by the three specific statistics (Fig. 5D, Figs S12 and S13). An excess of rare variants compared with common variants was indicated by the Tajima's D statistics (theoretical P < 0.001, Tajima's D < 0). The effect of demography was controlled based on the chromosome-wide data (empirical P < 0.009, Tajima's D < 0), with other confounding factors eliminated following the previous study [42] (Methods). The results of Fay and Wu's H statistics revealed an excess of variants with high derived allele frequency (empirical P < 0.03, Fay and Wu's H < 0). Moreover, haplotype homozygosity analysis with H12 statistics suggested the dominance of the top two haplotypes in EAS (empirical P < 0.009).
Haplotypes of ACE2 in the selected region (chrX:15586447-15601720) were clustered in a super branch in EAS (Fig. 5E). A great number of rare variants (i.e. singletons and doubletons) were observed on ACE2-hg1 and ACE2-hg2, implying their recent expansion. There were 44 (77.2%) branches observed on the node of the clustering center in EAS (Fig. 5E), while the numbers were much lower for other populations (≤30 branches (50%)) (Fig.  S14). The EAS-specific excess of rare variants was also revealed by both Fu and Li's D statistics (empirical P < 0.005, Fu and Li's D < 0) and Fu and Li's F statistics (empirical P < 0.006, Fu and Li's F < 0) [43] (Figs S12 and S13), which were designed by comparing the intermediate/low frequency variants with singletons.

Adaptive evolution of ACE2 in EAS over the last 3000 years
We applied Relate analysis [44] to investigate historical evidence of selection on ACE2, following the estimate of population size with chromosome X data in the 1000 Genomes Project (Methods). Population sizes inferred from the X-chromosomal data (Fig. S17) and from autosomal data [44] were comparable. The results indicated the onset of selection in modern human populations over the last 100 000 years in both AFR and EAS populations. The footprints of natural selection could be found in ACE2-hg1, ACE2-hg2 and ACE2-hg3 (P < 0.01) (Fig. S18).
However, in EAS, strong signatures were observed in HIMs shared by both ACE2-hg1 and ACE2-hg2, i.e. rs4240157 (P < 0.005), rs4646174 (P < 0.005) and rs879922 (P < 0.001). The three HIMs were also located in the region with significant DHH signals (Fig. 5F) and were also confirmed to be the most likely causal variants in EAS with iSAFE (for 'integrated selection of allele favored by evolution') analysis [45] (Fig. S19). The swiftest allele frequency increases occurred at ∼3-5 KYA in EAS (Fig. 5F, Fig. S18), roughly corresponding to the evolution of civilization in the Bronze Age. The selection coefficient reached 0.04 over the last 3000 years in EAS, assuming an additive model (Fig. 5F, Methods), or even greater under a dominant model (Fig. S18). The allele frequency changes inferred with Relate were also supported by the data of ancient samples across the Eurasian continent [26] (Fig. S20, Methods).

DISCUSSION
In this study, we focused on the genetic diversity of the ACE2 gene and analyzed one aspect of population genetics and molecular evolution. We were aware that, recently, many other genes have been reported to be potentially associated with COVID-19 susceptibility or severity. The heritability of severe COVID-19 susceptibility was estimated to be 6.5% based on genome-wide variations [11], suggesting a significant genetic contribution to COVID-19 susceptibility and severity. One reason why we focused on this particular gene, ACE2, was that we observed a connection between ACE2 variants and COVID-19 severity in our data. ACE2 is the key factor for the cell entry of SARS-CoV-2. Many studies have been trying to identify LoF/missense variants at ACE2 with high allele frequency differences across worldwide populations [9,10], which might be an indicator of population-specific susceptibility. There were two strong assumptions behind the idea and design of these studies. One was that COVID-19 susceptibility/severity must be associated with some novel mutations, and the other was that human susceptibility to COVID-19 was just recently established. However, all the ACE2 LoF/missense variants are of extremely low frequency among worldwide populations, preventing them from being potential genetic factors of COVID-19 susceptibility. Our analysis suggested that some regulatory variants at the intron region of ACE2 are more likely to contribute to host susceptibility to COVID-19, and these variants are standing variants in high frequency rather than novel mutations in low frequency.
By dissecting the haplotype structure of ACE2 across worldwide populations, we successfully connected the ACE2 haplogroups with COVID-19 severity (Fig. 4B), which was also supported by the variant-based association analyses (Fig. 4C). These results highlight the genetic contributions of ACE2 to COVID-19 severity. It is also a striking finding that ACE2-hg1 may function differently to ACE2-hg2, although they were recently diverged (∼250 KYA) and are in high sequence similarity (87% HIMs). In EAS populations, our data indicated high COVID-19 vulnerability of ACE2-hg1 carriers and low vulnerability of ACE2-hg2 carriers. The carriers of the two haplogroups, i.e. ACE2-hg1 and ACE2-hg2, are expected to experience different demography as ACE2-hg2 was under positive selec-tion in EAS. Summary data from the COVID19-hg database also supported the same trend of the vulnerability associated with ACE2-hg1 and ACE2-hg2 in populations of EUR ancestry.
When the entangled multiple factors and models are simplified, the varying prevalence of ACE2 haplogroups across global populations can be an indicator of population-specific susceptibility. EAS populations harbor both high-frequency ACE2-hg1 (43%) and ACE2-hg2 (53%), indicating differentiation of COVID-19 severities, i.e. the majority will have a lower risk of severe COVID-19, but another half will have a higher risk. In EUR populations, ACE2-hg1 is in similar frequency (40%) to EAS, but ACE2-hg2 is much lower (21%), which might mean a larger proportion of severe or critical patients. In SAS, ACE2-hg2 (47%) is more abundant than ACE2-hg1 (24%), meaning fewer severe or critical cases, relatively speaking. Despite this oversimplified model, the trends were supported by recent work that suggested SARS-CoV-2 may have a higher impact on the EAS-ancestry population than on the SAS-ancestry population [24]. The situation in African populations is more complex as the connection between diverse ACE2-hgs and COVID-19 severity has not been established yet. Despite the genetic contribution to COVID-19 severity being identified at ACE2, it would be noteworthy if other factors also played important roles, such as the influence of government policy and medical practice.
One of the most important findings of this study is that the genetic basis of susceptibility to COVID-19 via ACE2 might have been established early in human evolutionary history. A recent study proposed that the common ancestor of primates was strongly resistant to SARS-CoV-2, and that New World monkeys were completely resistant, whereas apes and Old World monkeys, like most humans, are susceptible [46]. Several recent studies proposed that ACE2 was under selection constraint in the human lineage [46,47]. Our data analysis also showed the extremely low allele frequencies of LoF/missense variants at ACE2 among worldwide populations, and the relatively recent TMRCA of all major ACE2 haplogroups. It is not unexpected, as ACE2 is a primary modulator of the RAS and plays a crucial role in regulating blood flow, pressure and fluid homeostasis. The functional conservation and importance of ACE2 mean natural selection must have been common in driving its diversity. However, population differentiation of ACE2 diversity also suggested ancestry-specific demography and local adaptation of ACE2 variants or haplotypes.
Our analysis revealed that the frequency increases of ACE2-hg2 experienced an acceleration in the Bronze Age. It is not clear what kind of selection pressures have driven this process; human activities like civilization evolution could play an important role, and some ancient viral epidemics and pandemics also likely contributed to the patterns. We would point out that ancient purifying selection may couple with the later positive selection on ACE2, resulting in selection cancelation to some degree. Present-day data and currently available methods have limitations with regard to reconstructing the detailed process, and only captured the eventual consequence. Accordingly, we can detect the positive selection of ACE2-hg2 in EAS, which is the most likely scenario, but we do not reject the possibility of other selective sweeps-they might just be too weak to be detected. For example, balancing selection on both ACE2-hg1 and ACE2-hg2 could not be well supported by our analyses. Meanwhile, the genetic variation of ACE2 in EAS was not likely to result from a negative selection, which cannot explain all the signals pointing to a positive selection. Suppose the negative selection has some considerable effects on ACE2 in EAS populations; we would expect ACE2-hg3 to be the most likely target of negative selection due to its minor frequency in EAS (<5%). We would also expect the frequency of ACE2-hg3 to be different between males and females because males are haploid and are more likely to be influenced by deleterious mutations. It turned out that the data do not support the expectation, i.e. the frequency of ACE2-hg3 is very similar between males and females (∼2% in males and ∼3% in females). Similar frequencies were also observed between males and females for other haplogroups (Fig. S2). Other continental groups may have different scenarios. We have found that the pattern went in the opposite direction in EUR populations, for example; the estimation of θ K was relatively smaller than that of θ π (Fig. S12). A previous study also reported significant positive Tajima's D statistics in populations of European ancestry [48], which might suggest a balancing selection in EUR.
The gene region of ACE2 is accessed from Ensembl (chrX:15579156-15620271) [58]. We collected 15 genes on the X chromosome that were reported to be related to COVID-19 susceptibility (COVID19Genes) (Table S5). Also, 64 genes related to immunity (ImmuneGenes) were determined based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [59].

Ethical statement
The ACE2 sequence data of Chinese COVID-19 patients were accessed from previous studies [14,28,29] with the approval of the Ministry of Science and Technology of China (Project IDs: 2020BAT0262, 2020BAT0611 and 2021BAT3145). All procedures performed in this study involving human participants were approved by the Biomedical Research Ethics Committee of Shanghai Institute of Nutrition and Health (no. ER-SINH-262005).

Statistical and population genetic analysis
Chromosome-wide genetic diversity was estimated within sliding windows of 10 kb in length, advanced by 5 kb by estimators of nucleotide diversity (θ π ) and numbers of segregating sites (θ K ) using publicly available software [60]. Tajima's D statistics [27], Fay and Wu's H statistics [37,38], H12 statistics [39,61,62], Fu and Li's D statistics [43], and Fu and Li's F statistics [43] were calculated in the same way. The empirical P-value was estimated by ranking the statistical values of ACE2 among all the sliding windows on protein-coding genes across the X chromosome. The confounding factors were controlled as in the previous study [42]. The DHH test was designed by combining Tajima's D, Fay and Wu's H, and H12 statistics as in a previous study [41], and normalized Fay and Wu's H was employed [38].
We performed hierarchical clustering and PCA of ACE2 sequences based on the distance matrix of genetic differences. We defined the HIMs as those with one allele fixed in at least one of the haplogroups and the other in the remaining haplogroups. We did further haplogroup classification for ACE2 sequences in all data sets based on the identified HIMs. Haplotype networks were constructed using the median-joining method implemented in Network (version 10) [63], followed by the maximum parsimony method. Within-group TMRCA and cross-group divergence time were estimated on the gene region covered by HIMs (34 kb in length). We performed 100 replicates in total, with 20 sequences randomly sampled from each group for each replicate.
We applied Relate (v1.1.6) [44] to the KGP data set for selection detection. The main procedure of Relate was performed on the combined data set of the X chromosome including all of the 3775 sequences, followed by the 'EstimatePopulationSize' and 'DetectSelection' analysis for each population. The selection coefficient was estimated based on the allele frequency change inferred by Relate. We assumed that the allele frequency change between every two time points could only be influenced by a constant selection force.
Detailed descriptions of the methods are available in the Supplementary Data.

DATA AVAILABILITY
Software for calculating genetic diversity as well as other sliding-window-based statistics is available at https://github.com/Shuhua-Group/Theta D H. Est/. Software for calculating TMRCA is available at https://github.com/Shuhua-Group/TMRCA/.