The origin and composition of Korean ethnicity analyzed by ancient and present-day genome sequences.

Abstract Koreans are thought to be an ethnic group of admixed northern and southern subgroups. However, the exact genetic origins of these two remain unclear. In addition, the past admixture is presumed to have taken place on the Korean peninsula, but there is no genomic scale analysis exploring the origin, composition, admixture, or the past migration of Koreans. Here, 88 Korean genomes compared with 91 other present-day populations showed two major genetic components of East Siberia and Southeast Asia. Additional paleogenomic analysis with 115 ancient genomes from Pleistocene hunter-gatherers to Iron Age farmers showed a gradual admixture of Tianyuan (40 ka) and Devil’s gate (8 ka) ancestries throughout East Asia and East Siberia up until the Neolithic era. Afterward, the current genetic foundation of Koreans may have been established through a rapid admixture with ancient Southern Chinese populations associated with Iron Age Cambodians. We speculate that this admixing trend initially occurred mostly outside the Korean peninsula followed by continuous spread and localization in Korea, corresponding to the general admixture trend of East Asia. Over 70% of extant Korean genetic diversity is explained to be derived from such a recent population expansion and admixture from the South.


Introduction
The 1000 Genome Project (1KGP) showed that East Asians displayed a common genetic bottleneck with non-African humans around the last glacial maximum (1000Genomes Project Consortium et al. 2015. However, the 1KGP project includes only five EA populations failing to fully represent EA genome structures. In 2009, the HUGO Pan-Asian Consortium (PASNP) confirmed a general concordance between linguistic and genetic affiliations (HUGO Pan-Asian SNP Consortium et al. 2009). Most recently, the Asian diversity project showed a correlation between geographical coordinates and genetic structure in Asia (Liu et al. 2017). Although Koreans are similar to the Chinese, the PASNP, 1KGP, and Asian diversity projects cannot fully explain the detailed makeup and peopling of the Korean Peninsula.
Koreans belong to the Altaic language group and are known to be homogeneous in Northeast Asia along with the Chinese and the Japanese. There are $85 million Koreans in total (51 mils. South and 25 mils. North Koreans, and 7 mils. outside of the Korean Peninsula) unified by shared ethnic and linguistic traits. There are currently several hypotheses on the origins of the Korean. The Korean Ychromosome haplogroup (O2b-SRY465) suggests the ancestors of the proto-Koreans are related to the people who inhabited northeastern China during the Neolithic (9,900-10,000 years BP) and Bronze (3,450-2,350 years BP) Ages (Kim et al. 2011). On the other hand, mitochondrial DNA (mtDNA) shows that Koreans display a very typical East Asian (Jin et al. 2009). Previous population studies have revealed that Koreans have not undergone any severe genetic bottlenecks and primarily consist of two genetic components (Takeuchi et al. 2017). One is strongly associated with China, but the other is less clear. Therefore, uncovering the exact genetic makeup of Koreans has not been carried out at a whole-genome scale using both present-day and ancient genomes.
Paleogenomics is a powerful tool to reveal the exact genetic lineages and affinities that cannot be resolved with present-day populations alone because frequent and complex genetic exchanges occur with or without cultural and linguistic exchanges. Archeological data unearthed in Korea provide the proto-Korean chronology and prehistories of the Korean Peninsula. The oldest archaic relics, such as the Acheulean axes, that have been found in South Korea date back hundreds of thousands of years, however, human bone preservation is poor due to the acidic soils and cannot acquire any ancient genetic data (Norton 2000). The earliest hominid evidences in the Peninsula date to be between 400,000 and 600,000 years ago (YA) (Park 1992). In spite of the claims about human bones in North Korea (Norton 2000; Bae and Bae 2012), these paleoanthropological materials are rare in Korea. Therefore, it is only possible to infer the exact Korean ethnic origins through ancient genomes found in the nearby regions, such as Devil's Gate in Russian Far East (8,000 years BP) (Siska et al. 2017) and Tianyuan cave, Beijing (40,000 years old) (Yang et al. 2017). Fortunately, Neolithic to Iron Age ancient genomes from Southeast Asia (SEA) have become available recently (Lipson et al. 2018). Such ancient genomes, taken from a wide geographic and temporal distribution, should allow us to answer when and how the genomes of Southeast Asia contributed to the genetic makeup of Koreans.

Data Set
A total of 88 Korean samples were used that are available from the KoVariome database (Kim et al. 2018

Whole-Genome Sequencing and Genotyping
Samples were subjected to WGS and genotyping (supplementary table S2, Supplementary Material online). Genomic DNA was extracted using a QIAamp DNA Blood Mini Kit (Qiagen, CA) and 69 WGS libraries were constructed using TruSeq DNA sample preparation kits (Illumina, CA). Sequencing was performed using Illumina HiSeq sequencers following the manufacturer's instruction. Low-quality reads were removed by NGSQC-toolkit (ver 2.3.3) with "-l 70 and -s 20" options (Patel and Jain 2012). Filtered reads were aligned to the human reference genome (hg19) using BWA-MEM (ver. 0.7.8) (Li and Durbin 2009). We further removed PCR duplicates using MarkDuplicates in Picard (ver. 1.9.2, http://broadinstitute.github.io/picard/, last accessed April 17, 2020) and conducted IndelRealigner and BaseRecalibration using GATK (ver. 2.3.9) (McKenna et al. 2010). We predicted individual single-nucleotide variants using GATK UnifiedGenotyper (McKenna et al. 2010) with "-heterozygosity 0.0010 -dcov 200 -stand_call_conf 30.0 -stand_emit_conf 30.0" options. To confirm artifacts in the variants merging from various resources which can occur during the production process caused by different sequencing platforms, alignment algorithms, and genotype callers, WGS-based variants were merged with the six Koreans' genotypes generated from the human SNP panel data (Lazaridis et al. 2014). Finally, we pruned the panel with linkage disequilibrium information using plink with "-indep-pairwise 200 25 0.4" option (Purcell et al. 2007

Genomic Clustering
We used CHROMOPAINTER to infer "chromosome chunks" for each individual for fineSTRUCTURE (Lawson et al. 2012) analysis and clustered 88 Koreans (supplementary table S1, Supplementary Material online) and 208 present-day individuals (supplementary table S2 In total, we reclustered 185 present-day genomes and 6 Korean genomes using CHROMOPAINTER and fineSTRUCTURE (Lawson et al. 2012). Using these individuals, we implemented ADMIXTURE (ver. 1.23) (Alexander et al. 2009) with K ¼ 2-14 (supplementary figure 3, Supplementary Material online). We generated a dendrogram with each of the ADMIXTURE result (K ¼ 2-14) using the hcluster function in R. We evaluated the consistency of the ADMIXTURE and fineSTRUCTURE results by calculating correlation using the "cor.dendlist" function with the "cophenetic" method in the "dendextend" package in R (supplementary figure 4, Supplementary Material online). It showed the highest correlation when K ¼ 10 (corr. ¼ 0.78). We used the admixture result of K ¼ 10, which best represents the genetic cluster analyzed by fineSTRUCTURE. We performed a principal component analysis (PCA) analysis conducted with EIGENSOFT (ver. 6.0.1) smartpca (Patterson et al. 2006).

Admixture Time Estimation
We implemented the ALDER program (Loh et al. 2013) to estimate the admixture time of Korean using the Korean itself as one reference population. We used filtering criteria of a genotype rate >99%, MAF > 0.01, and Hardy-Weinberg equilibrium P value > 0.000001.

The Genetic Affinity between the Ancient and Present-Day Populations
To investigate the genetic relationship between populations of interest, we used the D and outgroup f3 statistic framework by using ADMIXTOOLS (Patterson et al. 2012). The genetic affinity between the ancient and present-day populations was measured with the outgroup f3 statistic using the following notation: f3(X, Y; Yoruba), where X and Y are ancient and present-day populations, respectively. To better represent the genetic association of the present-day population against a focal ancient genome, we applied a scaled f3 statistic by f3 scaled ¼ (f3 To cluster ancient genomes in this study, we analyzed a pairwise outgroup f3 statistic with a form of f3(X, Y; Yoruba). In this analysis, both X and Y were ancient genomes.

Admixture Model Construction
To construct an admixture model depicting the historical genetic makeup of Koreans and other Asians, we fitted the SNP panel to the admixture models with the qpgraph program (Patterson et al. 2012) based on results from D-statistics and f 3 statistics in our study. We first set the skeleton for the admixture model as Tianyuan, Onge, and Ami by adapting a previous study (McColl et al. 2018) (worst-fitting Z ¼ 0.044). Then, we added Kinh which has a high admixture F3 score with Devil's Gate to Koreans (worst-fitting Z¼À3.887) and then to Devil's Gate, Ulchi, Koryak, Mixe, and MA1 (worstfitting Z ¼ 3.317). Finally, Koreans, Han, and Japanese have been added to model the suggested admixture of East Siberians (E si ) and East Asians b (EA b ) (worst-fitting Z value of À3.686). We manually calibrated the final model with a time point which was estimated using the ALDER results.

Korean Genetic Structure
To infer the genetic association between the 88 Koreans (supplementary table S1, Supplementary Material online) and our selected neighboring populations, we collected with WGS from 185 contemporary individuals belonging to 91 populations ( fig. 1A and supplementary table S2, Supplementary Material online). We included people from 21 and 31 Southeast Asian and North Asian ethnic groups, respectively, from which Koreans could have originated. We predicted an average of 1.5 and 2.6 mega homo-and heterozygous singlenucleotide variants from each individual, respectively (supplementary table S2, Supplementary Material online). We merged WGS-based SNPs with the human origin SNP panel data set and finally produced 199,629 autosomal SNPs for genetic comparison. To infer the genetic structures of the Korean ethnic group, we clustered 94 Koreans, including 6 published Koreans genotyped with SNP chip, by applying the CHROMOPAINT and fineSTRUCTURE (Lawson et al. 2012) programs. These algorithms clustered 279 individuals into 64 homogeneous groups according to the haplotype patterns , where X and Y are ancient and present-day populations, respectively. We scaled f3 statistics between 0 and 1. In the heat map, black indicates that the f3 scaled value is close to 0 and red indicates the value is close to 1. For ancient genome X (on rows), the scaled f3 statistic for a given cell is calculated by f3 scaled ¼ (f3Àm)/ (MÀm), where m and M represent the minimum and maximum f3 statistic. Therefore, the smallest f3 in each column has f3 scaled -statistic ¼ 0 (black) and the largest has f3 scaled -statistic ¼ 1 (red). We ordered ancient genomes in the x axis according to the time scale. We also separated Central Steppe (CS) ancestry (black arrow) (de Barros Damgaard et al. 2018) and Chinese and Southeast Asian ancestry genomes (blue arrow) (Lipson et al. 2018). P on the bottom bar, Pleistocene hunter-gatherers; N, B, and I, Neolithic hunter-gatherer, Bronze, and Iron age, respectively. Overall, data for these statistics are found in supplementary figure  dominated in the E si and EA a/b populations, respectively; although, these ratios were slightly different depending on the number of ancestral groups (K). The dendrogram correlation analysis showed the greatest consensus between the fineSTRUCTURE clades and ADMIXTURE results at K ¼ 10 (supplementary figure 4, Supplementary Material online). At K ¼ 10, we observed 38% and 62% of the E si and EA a/b genetic components in the Koreans, respectively ( fig. 1C). Comparing admixture rates among the EA b populations, both the Korean and Japanese populations showed very similar levels of genetic admixture rates, consistent with their sister groups in the fineSTRUCTURE tree ( fig. 1C). Takeuchi et al. (2017) reported a high degree of genetic similarity between the Korean and mainland Japanese and the estimated admixture date of the EA-wide genetic component to Japan was in the Yayoi period (3,000-1,700 years BP). The Chinese also have similar genetic compositions to the Korean and Japanese; however, their admixture rates differed depending on geographic region. Overall, we conclude that genetic admixture events occurred first between the Southeast Asians and Chinese outside Korea and Japan and then spread, rather than occurring separately in Korea or Japan locally. It is also possible that such a recent genetic admixture was a broad phenomenon, happening concurrently all across EA driven by a population expansion caused by the agricultural, economic, and technological advances of the last 4,000 years (Lipson et al. 2018 (Haak et al. 2015). In addition, we included the Tianyuan genome from northern China (Yang et al. 2017), two ancient genomes unearthed from the Devil's Gate cave near North Korea (Siska et al. 2017), and eight ancient genomes from Southeast Asia dating from the Neolithic to the Iron Age (Lipson et al. 2018), making a total of 115 genomes. We measured levels of pairwise genetic affinity among the ancient and present-day genomes by using outgroup f3-statistics, with a form of f3(ancient, present-day; Yoruba) (Patterson et al. 2012). This analysis calculates the global landscape of the genetic associations between ancient and present-day genomes (supplementary figure 5 and table S4, Supplementary Material online). The f3 scaled -statistics showed that the ancient Tianyuan individual (40,000 years BP from China) shares more alleles with present-day Siberians (E si and W si ) and East Asian (EA b ) populations than with other present-day populations such as European, West-, and South Asians (supplementary figure 5, Supplementary Material online). It suggests Tianyuan is the basal genetic component of the East Eurasian and East Asian lineage. We also observed that present-day E si and EA b populations had significant genetic affinities with ancient Southeast Asians We examined Tianyuan's genetic affinities for E si and EA a/b using D-statistic in the form of D(Yoruba, Tianyuan; E si , EA a/b ) (supplementary figure 8, Supplementary Material online). In these statistics, the Tianyuan genome showed a higher level of genetic affinity with present-day E si than Southeast Asians. However, several EA b (Korean, Japanese, and south Chinese) populations showed similar levels of affinity with Tianyuanderived alleles to the E si populations and were equally distant to Tianyuan lineage. This suggests Devil's Gate ancients and present-day E si and several EA b populations were subject to similar genetic influences over time and are expected to be a single clade since they are all separated originally from the Tianyuan lineage. These lines of analysis reveal that the basal ancient of the Tianyuan genome was separated in the Neolithic or pre-Neolithic era and independently affected current Koreans.

The Ancient Gene Flow Making Up the Korean Ethnic Group
We focused on the gene flow from the Neolithic ancients into the Korean and EA populations. Based on the Tianyuan's gene flow into Neolithic ancients and present-day populations, we hypothesized that either the Neolithic ancient genome contributed to the genetic ancestry of Korean or EA populations independently, or a second gene flow could have occurred ( fig. 2B) . 3B). We also analyzed genetic associations of ancCS to other ancients and present-day populations with a form of D(Yoruba, ancCS; ancient, present-day populations) (supplementary figure 9, Supplementary Material online). It inferred that present-day E si and EA populations and ancSEA are equally related to ancCS by sharing similar levels of ancCS-derived alleles. It is an agreement with genetic admixture patterns of Asian ancestry in CS ancients (Allentoft et al. 2015;Damgaard et al. 2018). It supports genetic admixture between ancCS and present-day EA populations, however, it cannot explain how and how many events the ancCS influence toward EA occurred. We also observed the first evidence of the genetic divergence of Vat Komnou and several EA b (Southeast Asian and Southern China) populations from Man Bac ( fig. 3B and supplementary table S7, Supplementary Material online). This supports the idea that these ancients are new genetic resources that genetically influenced EA ( fig. 2A). We observed several possible ancient founders by D-statistics, however, it could not clearly resolve the current genetic makeup of Korean. To resolve the genetic relationship of the genetic makeup of Korean, we additionally analyzed the admixture pattern of the ancient/present-day Southeast Asians and Devil's Gate ancients to Koreans with admixture f3 statistics (table 1). Notably, the combinations of the Devil's Gate genome and ancSEAs better represent the current Koreans than those of Devil's Gate and modern Southeast Asians. Specifically, we observed the lowest admixture f3-statistics when source 1 was Vat Komnou (Iron Age in Cambodia), followed by Nui Nap (Bronze Age in Vietnam). In a previous study, Nui Nap was a new genetic component close to present-day Vietnamese and Dai but not the ancestors of Austroasiatic speakers (Lipson et al. 2018). Meanwhile, next ancSEAs with lowest admixture f3-statistics were Ban Chiang and Man Bac who are also ancients of Austroasiatic speakers. In order to investigate whether the ancSEA genetic components migrated into Korea, we analyzed the Koreans' genetic affinity with present-day populations by outgroup f3statistics with a form of f3(Korean, present-day populations; Yoruba) ( fig. 3C and supplementary table S8, Supplementary Material online). It showed the group with the highest genetic affinity with the Koreans were the Japanese. The southern Chinese (Han, and She) had a higher genetic affinity with Koreans than the present-day Lau or Vietnamese, which is consistent with the admixture results ( fig. 1C). This suggests that the genetic components of South Chinese were transferred into Korea after admixing with Vat Komnou and Nui Nap ancestries (fig. 3C). These lines of evidence support the conclusion that populations who carried Devil's Gate and Man Bac genomes admixed throughout the EA b and E si regions until the Neolithic period, probably accompanied by the climate changes and barriers. After the Bronze Age, the admixed genetic ancestry of the Vat Komnou and Nui Nap migrated to Korea due to rapid cultural and technological advances.

Korean Haplotype Analysis Reveals Multiwaves of Genetic Components
We analyzed haplotype distributions using WGS data of 88 unrelated Koreans generated from the KoVariome database (Kim et al. 2018) (supplementary table S1, Supplementary Material online). Nonrecombining Y-chromosome analysis showed a significant proportion of the "O" haplogroup in 55 male Koreans, 29% "O2b" and 42% "O3" (fig. 4A). The next most frequent Y-chromosome haplogroup was "C" (18%). The Y-chromosome haplogroup distribution agreed with well-established Y-chromosome haplogroup "O" expansion and colonization within the Korean Peninsula (Kim et al. 2011). A comparison with the global Y-chromosome haplogroup distribution suggested that haplotype "C" is widespread in Siberia, whereas "O" haplogroups show a spatial distribution in Southeast Asia (Chiaroni et al. 2009;Karmin et al. 2015). This strongly suggests a dual origin for Korean males. In contrast to the Ychromosome distribution, mtDNA haplotypes reflect a more complex genetic history ( fig. 4B). The most frequent mtDNA haplotype was "D" (34%) and ten additional mtDNA haplogroups ("M," "B," "N," "G," "F," "R," "A," "C," "Y," and "Z") were identified with frequencies ranging from 23% to 2%. We constructed an mtDNA tree combining 11 ancients, and 99 present-day EA a/b and Siberian (E si and W si ) mtDNAs ( fig. 4C). We included 11 ancients in this tree who had relatively high-sequencing depth (supplementary table S9, Supplementary Material online). Similar to the global human-mtDNA phylogeny, our mtDNA tree shows two major clades, M 0 and R 0 , dominantly distributed in EA populations (Soares et al. 2009). It also shows two mtDNA dispersions $40 and 20 ka, which account for 62% and 38% of the present-day Koreans, respectively. The earlier dispersed mtDNAs included "N/Y/A," "D," and "B/R" which were distributed to 16%, 34%, and 12% of Koreans, respectively. The mtDNA haplotypes of the "N/Y/A" and "D" were clades coclustered with present-day Siberians as well as the Devil's Gate ancients, representing Eurasian ancestry. The "A" haplogroup was also frequently observed in the early and middle Bronze Age Okunevo peoples (Lipson et al. 2018), who were culturally associated with baKarasuk (Lipson et al. 2018). We also identified ancient mtDNA "R 0 " divergent into "B/R," accounting for 12% of Koreans, that also expanded $40 ka. The root of this clade was Tianyuan, and also coclustered with Vat Komnou ancients and present-day Chinese, representing EA ancestry. This could explain the genetic influence of the Tianyuan on Korean genomes via ancSEA. These old mtDNA waves accounted for human migration in the late Pleistocene when the Yellow sea of Korea was land, therefore, the west coast of Korea was connected to the mainland of China. The later dispersed mtDNA haplogroups consisted of "G/C/Z," "M," and "F" which account for 19%, 12%, and 7% of Koreans, respectively. The "G/C/Z" clades coclustered with Siberians and Bronze Age Nui Nap in Vietnam. However, the genetic origin of the Nui Nap is still unknown. On the other hand, the mtDNA haplogroup "C" is frequently observed from the early and middle Bronze Age Okunevo peoples who lived in central steppe regions (Lipson et al. 2018). The mtDNA topology and haplotype frequency in Okunevo imply a genetic association between Nui Nap and central steppe ancients. Both of the "M" and "F" clades showed subsequent diversification from ancient mtDNA haplogroups of ancM (M 0 ) $20 ka and ancR (R 0 ) divergent in 60 ka, respectively. These clades explain southern waves of human migration by coclustering with EA b populations. In particular, two ancients of Austroasiatic speakers, Man Bac and Ban Chiang, coclustered in the mtDNA "M" lineage ( fig. 3C). It suggests that a subsequent expansion of this clade can be associated with the expansion of the Austroasiatic speaking population (Lipson et al. 2018). Haplotype analysis and the phylogenetic tree of the mtDNA support a continuous genetic influence from the north and south into Korea.

Admixture Time Estimation for Koreans
We estimated the admixture time of Koreans using 286,222 SNPs and obtained significant prediction results from only three populations as references; Yakut, Han, and Japanese (table 2). The estimated admixture time was 5,482, 3,583, and 2,827 YA when we used the Koreans itself as one reference and Yakut, Han, and Japanese as the other comparison reference population, respectively. Our estimated admixture time with Japanese (97 generations away from the Japanese) is slightly earlier than the admixture date of the mainland  Japanese (52 generations) estimated by Takeuchi et al. (2017). We summarized our model of the genetic influence by pre-Neolithic Tianyuan to Iron Age Vat Komnou on Koreans in figure 5. This model supported the above gene flows well, suggesting Koreans contain prehistoric genetic components derived from Devil's Gate and Man Bac groups both of whom are divergent from Tianyuan ancestry. The Neolithic Man Bac genome dominantly inherited the genetic components of Tianyuan and showed its genetic components widely distributed in EA. However, the Bronze and Iron Age ancients, such as Oakaie, Nui Nap, and Vat Komnou, seem to have much altered genetic components of EA b genomes (70%). This is consistent with the EA b ancestry frequency in contemporary Koreans. This model generally describes well the gene flow among the three Northeast Asians; Korean, Chinese, and Japanese.

Conclusion
We analyzed the haplotype distributions of 88 Koreans compared with ancient and modern whole genomes and suggested two major haplotype expansion events. A comprehensive genome comparison confirmed that Koreans possess dual ancestral genetic components originating broadly from East Siberia (E si ) and East Asia (EA b ). Ancient genome comparisons revealed that the genetic makeup of Koreans can be best described as an admixture of the Neolithic Devil's Gate genome in Russia and the Iron Age Vat Komnou in Southeast Asia. Our analyses of ancient and present-day populations suggest a long and gradual admixture model of two Neolithic founders, the Devil's Gate founder in Russia and the founder from Tianyuan Cave in China. These two major components were admixing FIG. 5-Admixture tree model depicting the historical genetic makeup of Korean. A qpgraph (Patterson et al. 2012) fitted on an admixture model depicting the historical genetic makeup of Koreans and other Asians. We fitted the admixture tree model with ancient genomes associated with EA b populations to make a model that could best explain the gene flow that makes up Koreans and hence the admixture model information for E si ancestry has been simplified. Based on the D-and f3 statistics and previous reports (Lipson et al. 2018), we set the skeletal tree (supplementary figure 10A, Supplementary Material online) and extended the model by adding ancient and present-day individuals (supplementary figure 10, Supplementary Material online). The average admixture time of Koreans is noted next to the red circle which was estimated by ALDER (table 2). Black circles represent ghost genomes in ancestral genetic lineages lacking any evidence for a time calibration and new groups may be added when more ancient populations are found and sequenced. Black lines represent the gene flow and dotted lines represent admixture events with the marked proportions estimated by qpgraph analysis. The admixture time is shown in generations before the present. The number in the parentheses indicates 95% confidence interval of the generation and years.
throughout East Siberia and East Asia for an extended time up until the Neolithic period. Subpopulations of current East Asians, as well as modern Koreans, were probably established by a later regional genetic transition during the Bronze Age. The peopling of Korea is most likely a part of large population expansion and the subsequent admixture events which occurred in East Asia, rather than a unique isolated event or migration. We think that this kind of recent rapid expansion and admixture could be general models for other East Asian and Southeast Asian populations in which Bronze and Iron Age populations expanded and admixed with other peripheral region populations.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.