Whole-genome resequencing analyses of five pig breeds, including Korean wild and native, and three European origin breeds

Pigs have been one of the most important sources of meat for humans, and their productivity has been substantially improved by recent strong selection. Here, we present whole-genome resequencing analyses of 55 pigs of five breeds representing Korean native pigs, wild boar and three European origin breeds. 1,673.1 Gb of sequence reads were mapped to the Swine reference assembly, covering ∼99.2% of the reference genome, at an average of ∼11.7-fold coverage. We detected 20,123,573 single-nucleotide polymorphisms (SNPs), of which 25.5% were novel. We extracted 35,458 of non-synonymous SNPs in 9,904 genes, which may contribute to traits of interest. The whole SNP sets were further used to access the population structures of the breeds, using multiple methodologies, including phylogenetic, similarity matrix, and population structure analysis. They showed clear population clusters with respect to each breed. Furthermore, we scanned the whole genomes to identify signatures of selection throughout the genome. The result revealed several promising loci that might underlie economically important traits in pigs, such as the CLDN1 and TWIST1 genes. These discoveries provide useful genomic information for further study of the discrete genetic mechanisms associated with economically important traits in pigs.


Introduction
Domestication of pigs originated from local wild boar populations in Europe and Asia circa 10,000 yrs ago 1 and occurred independently from wild boar subspecies in Europe and Asia. 2 There are hundreds of domestic pig breeds available across the world, and pigs have served as one of the most important sources of animal protein for humans. Since the last century, traditional selective breeding has contributed significantly to the genetic improvement of economically important traits in pigs. For example, substantial improvements were achieved in pigs per sow per year (annually +0.25) and the feed conversion ratio (annually −0.009 to −0.070 kg of the dry matter intake per live mass gain kg) during recent decades. 3,4 Furthermore, some pig breeds have been used as mammalian model animals for the biomedical research because of their physiological and anatomical similarities to humans, and for developing several swine lines, including highly inbred miniature pig breeds. 5,6 However, most of the achievements were made in European origin breeds, whereas, there are fewer systematic genetic improvement programs for diverse Asian local breeds.
There are two types of Korean indigenous pigs currently registered with the worldwide Domestic Animal Diversity Information System of the Food and Agriculture Organization of the United Nations: Chookjin Chamdon and Jeju native pig. 7 However, they are generally termed as the Korean native pig because of their same origin until the middle of the last century. Korean native pigs are known for their peculiar characteristics including higher intramuscular fat contents and chewy texture, which are appealing palatability factors to Korean pork consumers. However, they also have poorer productivities, such as lower growth rates and litters per sow per year than typical imported breeds, such as Duroc and Yorkshire. 8,9 Since the beginning of the last century, the native breeds have been endangered, especially because of the extensive crossbreeding with economically promising imported breeds. To conserve the native genetic resource, recent efforts have been made by the National Institute of Animal Science (NIAS) in Korea to restore and manage the Korean native population. 10 The Korean native pig is genetically more closely related to other Asian native pigs than to most European origin pigs. 11 Kim et al. 12 further showed that it might have a distinct genetic distance even from some Chinese native pig breeds.
The completion of the swine sequencing project has resulted in many single-nucleotide polymorphisms (SNPs) being identified throughout the genome. 1 The recent advances in next-generation sequencing (NGS) technologies have led to further cataloguing of SNPs by resequencing of diverse pig breeds or multiple animals in comparison with the reference sequence assembly. Furthermore, the SNPs derived from NGS have been used successfully to dissect genomic characteristics of pigs. For example, Rubin et al. revealed significant signatures of selection in pig genomes by scanning whole genomes using SNPs. 13 In addition, three Berkshire pigs were resequenced recently to explore their genetic relationship with 38 pig genomes obtained from the public database, providing useful information on the breed's origin and domestication. 14 In this study, we present whole-genome sequencing analyses of 55 pigs of five breeds, including Korean wild boar, Korean native, Duroc, Landrace, and Yorkshire. Substantial numbers of SNPs were identified across the genome using the Swine reference assembly (Sus scrofa 10.2). SNP sets derived from the whole-genome sequencing were used to explore genomic characteristics among the diverse pig breeds and to detect genomic regions potentially implicated in the strong selective breeding applied to the pig populations in this study.

Sampling and DNA extraction
We sequenced 55 animals from five pig breeds, including 10 Korean wild boars, 10 Korean native pigs (4 boars and 6 sows), 6 Duroc pigs (3 boars and 3 sows), 14 Landrace pigs (7 boars and 7 sows), and 15 Yorkshire pigs (7 boars and 8 sows) for this study ( Table 1). The Korean native pigs were accessed at the Swine Science Division, NIAS, Rural Development Administration (RDA), Suwon; and the Subtropical Animal Experiment Station in NIAS, RDA, Jeju. The Duroc, Landrace, and Yorkshire pig samples were collected at the NIAS and Sunjin Co., Ltd in Korea. The Korean wild boars were captured in either Gyeonggi or Gyeongsangnam Provinces in South Korea, sampling five animals from each province, respectively. Genomic DNA was extracted from blood samples in EDTA using a Wizard Genomic DNA kit (Promega, Madison, WI, USA). The genomic DNA of the Korean wild boars was extracted from meat samples. The quality and quantity of the DNA were evaluated using a Nano-Drop spectrophotometer (NanoDrop Technologies, USA) and gel electrophoresis in 1% agarose gel. The concentration of doublestranded DNA was checked using a Qubit dsDNA HS Assay (Invitrogen, USA). The National Institute of Animal Science's Institutional Animal Care and Use Committee reviewed and approved the study protocol and standard operating procedures (No. 2009-077, C-grade).  A_Coverage, assembly coverage calculated as the proportion of bases in the genome assembly that were covered by at least one read. b Ave_Fold, average fold that was calculated as the average depth of coverage across the whole genome.

Library construction and sequencing
( paired-end 100-bp reads). We generated 24,932,652,290 reads (2,518 Gb) from 55 samples. Each sample was sequenced to have over >34 Gb initial reads to produce a minimum of 10-fold genomic coverage.

Mapping, SNP calling, and annotation
The sequenced reads were mapped to the Swine reference genome assembly (Sscrofa10.2) using Burrows Wheeler Aligner version 0.5.9 with default options. 15 The genome sequence was downloaded from the UCSC genome browser (http://hgdownload.soe.ucsc.edu/goldenPath/ susScr3/bigZips/susScr3.fa.gz). We also screened the unmapped reads and the reads that were aligned to unplaced scaffolds. Of the aligned reads, removal of polymerase chain reaction duplicates and resynchronization of the mate information were performed using Mark-Duplicates and FixMateInformation in the Picard software package version 1.48 (http://picard.sourceforge.net/), respectively. Local alignment around indels was performed to the duplication-removed reads using RealignerTargetCreator and IndelRealigner in the Genome Analysis Toolkit16 (GATK; version 1.0.5974). Furthermore, basepair quality scores were recalibrated using CountCovariates and TableRecalibration in the GATK. 16 Multi-sample SNP genotyping was performed to identify SNPs using UnifiedGenotyper in GATK. To reduce the false discovery rate, the filtering steps followed these criteria: QUAL < 30.0, QD < 2.0, MQ < 40.0, FS > 60.0, HaplotypeScore > 13.0, MQRankSum < −12.5, and ReadPosRankSum < −8.0. All the filtered SNPs were annotated to belong to 12 functional categories (

Phylogeny and admixture analysis
To perform phylogenetic and admixture analyses, haplotypes of all the animals in this study were inferred by an imputation tool, BEAGLE, version 3.3.2. 19 The phased haplotypes based on the SNPs were used to conduct phylogenetic analysis. The phylogenetic tree was generated using neighbour-joining tree estimation with pairwise distance, and Kimura's two-parameter distance method implemented in Analysis of Phylogenetic and Evolution (version 3.0.6) in R language. 20 The tree was plotted using MEGA5, 21 and the population structures were accessed by admixture analysis using the STRUCTURE version 2.3.4, which is based on the maximum likelihood method. 22 We used different values of K (number of putative ancestral clusters of allelic similarity) and an admixture model with a correlated allele frequency to assign the K clusters. 23 A 30,000 burn-in period of Chain Monte Carlo searches followed by 20,000 replicate runs were performed at each K from 2 to 5. The result was then plotted using DISTRUCT version 1.1. 24 The pairwise similarity between each animal was computed by the number of the same SNP genotypes divided by the total number of SNPs. The order of the 55 animals for the STRUCTURE plot and the similarity matrix followed the same order used in the phylogenetic tree.

Identification of selective sweeps
To identify putative selective sweeps, we searched genomic regions with higher degrees of fixation, following the method by Rubin et al. 13,25 The number of major and minor allele sequence reads were counted at each of the identified SNP in each breed. We filtered the SNP positions whose minor allele frequency was <0.05. We then scanned the genome using 50% overlapping windows of 150 kb in size, which was successfully adopted in the previous pig resequencing study. 25 Among the 150 kb windows, we excluded windows with fewer than 10 of the number of heterozygous SNPs in each breed. Briefly, the Pooled heterozygosity (Hp) was initially computed by an equation: 2*sum_maj*sum_min/(sum_maj + sum_min), 2 where sum_maj and sum_min are the sum of major and minor allele frequency at the given 150 kb window, using the numbers of reference and alternative allele sequence reads observed in each of the breeds. The ZHp score was calculated from Z transformation of the Hp score as described by Rubin et al. 13

Data access
All SNPs detected in this study have been submitted to the NCBI dbSNP with the accession numbers: ss1754731760-ss1774855430.

SNP detection
We identified a total of 20,123,573 SNPs throughout the genomes from all the 55 sequenced pigs of the five breeds in this study. The number of SNPs in each breed population varied from 6.6 to 14.0 million; the least number of the SNP set was observed in DUR, while the most values were found in KWB (Table 2). This was as expected, because the DUR was the breed used to construct the Sscrofa 10.2 reference assembly, and the KWB is the most genetically distant of the pig breeds in this study. 1,25 In addition, we do not rule out the possibility that parts of the SNPs could be caused by the within-population variation in this study. Of the total SNPs, ∼25.5% were found to be novel, as assessed by the dbSNP build 137 (Fig. 1a). Correspondingly, KWB showed the highest percentage of novel SNPs (24.9%) compared with   Table S1). These results well suggested that most of the SNPs identified in this study were reasonably accurate.

SNP annotation and gene of interest
All the detected SNPs were annotated functionally using Ensembl gene annotation and dbSNP databases, assigning them to 12 functional classes. In agreement with previous studies, most of the SNPs were located in intergenic and intron regions (74.1 and 24.6% respectively), while fewer SNPs (1.3%) resided within exonic region, including exonic, splice sites, and untranslated regions ( Fig. 1b and Table 2). Among the genic SNPs, we identified a substantial number (35,458) of nonsynonymous SNPs (nsSNPs) in a total of 9,904 genes, which could be associated with traits of interest in pigs (Supplementary  Table S2). To prioritize the nsSNPs, all the nsSNPs were further assigned a SIFT score to classify the effects of each nsSNP on phenotypic functions, based on the amino acid conservation. 29 According to the typical interpretation of the score, a score of ≤0.05 classifies an nsSNP as 'damaging' and a score of >0.05 corresponds to 'tolerant'. There were 6,849 damaging and 27,215 tolerant nsSNPs in our data (Supplementary Table S2). Furthermore, we extracted nsSNP sets that did not overlap among the breeds in this study (breed-specific nsSNP) and identified 126, 26, 11, 10, and 8 for KWB, KNP, DUR, LAN, and YOR, respectively (Supplementary Table S3).
Of the breed-specific nsSNP, we identified several interesting nsSNPs that may be implicated with economically important traits in the pig. For example, a YOR-specific novel nsSNP (SIFT score: 0.03; 6,916,234 bp on chromosome 2) was located in the phospholipase C, beta 3 gene (PLCB3). PLCB3 is involved in the mammalian oocyte at fertilization, spermatozoa activation, and embryonic lethality in sea urchin and mouse studies. [30][31][32] Notably, the Yorkshire (YOR) has been widely used as a maternal line breed, particularly for its large litter size; thus, the SNP in PLCB3 is a candidate SNP to account for genetic variation in the reproduction trait, whose genetic effects have been difficult to predict. Little is known about which genes affect litter size, which has a low heritability and is a highly appreciated trait in the pig industry. 33,34 A homozygous SNP was found in all 10 KWB individuals sequenced in this study (GG against the A reference allele). Interestingly, the same pattern was observed in the European wild boar samples, while the G allele frequency is around 33% in the Asian wild boars in a recent independent study (Dr Martien Groenen, personal communication). The gene affected by the SNP has homology to Neurexophilin and PC-esterase domain family, member 4 (NXPE4). Members of the NXPE family are neuropeptides that potentially signal via alpha-neurexins. 35 It could be hypothesized that the NXPE4 homolog is involved in neurological mechanisms associated with temperament or fear, whose genetic mechanisms are poorly understood in pigs. Although it is beyond the scope of this study to conclude how each of the genes are affected by the nsSNPs, the many damaging nsSNPs identified provide ample information to identify promising candidates for further studies to dissect genetic mechanisms of diverse traits in pigs.

Phylogenetic and admixture analysis
To examine the population structure and genetic relationships of the five breeds, we carried out multiple analyses, including phylogenetic analysis by neighbour-joining estimation, similarity estimation by comparison of genotypes, and admixture analysis using the STRUC-TURE program. All five breeds were consistently separated by those three methods, as expected. The wild boar (KWB) was mostly distinguished from the four domesticated breeds (KNP, DUR, LAN, and YOR), and a clear genetic distance was also observed within the domesticated breeds with the respect to each breed (Fig. 2a). The result was further supported by the similarity matrix, which showed genotyping similarity of <41% between all the KWBs and the domesticated breed individuals. KWB appears to be genetically closer to KNP than any other breed, as shown in both the phylogenetic tree and similarity matrix. The similarity level between KWB and KNP varied from 38.8 to 41%, while that between KWB and the other three breeds was 36.7-38.9%. We observed the maximum value of the genotyping similarity (∼70%) across the domestic breeds, while the minimum values within each breed were >70%, except for the KNP (minimum value = 68.1%). In the STRUCTURE analysis, we found the best-fit value of K as 5 to estimate the most likely number of genetic clusters, which corresponded to the number of breeds used in the study. While observing obvious genetic differentiation among the breeds, LAN and YOR have partially shared genetic differentiation signals; YOR shared 4-12.5% of LAN and LAN shared 8.3-31.5% of the YOR. Interestingly, among the 10 KNP individuals, six animals (Chookjin Chamdon, KNP Nos. 1-6) appeared to be genetically distant from the other four KNPs (Jeju native pig, KNP Nos. 7-10) that share other breed populations (11.3-14.6, 12.8-17.3, and 11.4-13.3% for DUR, LAN, and YOR, respectively). The result was found consistently in the phylogenetic tree, similarity matrix, and STRUCTURE analyses in this study. KNPs can be classified as two subgroups with respect to their geographical location in Korea, and they have been bred separately for ∼50 yrs. Despite their same origin until the last century, this result may imply that the Chookjin Chamdon and Jeju native have discernable genetic differences that may be caused by their geographical isolation and different environments. In addition, we did not rule out the possibility that the native breed may have been affected by unrecorded crossbreeding with imported breeds before the systematic management of the native genetic resources.

Selective sweep analysis
The whole genomes were scanned to identify genomic regions with excess homozygosity as indicative of a signature of selection in each breed. By applying 50% overlapping windows of 150 kb in size, 32,651 windows were used in the genome scanning. All the detected SNPs in each breed were used to calculate the ZHp. Altogether, 236, 472, 1,568, 383, and 363 windows were removed that had fewer than 10 Table S5 and S6). The CLDN1 genomic region was further examined by accessing pooled heterozygosity (Hp) calculated from each allele of the 142 SNPs that were identified in the region (Fig. 3c). Given that Hp scores of 0 and 0.5 indicate complete homozygosity and heterozygosity, respectively, three breeds other than the LAN and YOR showed distinctively lower homozygosity at the CLDN1. The CLDN1 is a dominant tight junction protein that plays a role in invasion activity during metastasis, and its biological functions are known to maintain pregnancy and facilitate conceptus attachment. 36,37 Note that, of the five breeds in this study, both LAN and YOR are outstanding maternal line breeds, with superior reproduction traits. In addition, a recent report showed that the Claudin family members are expressed more abundantly in Yorkshire pigs, as assessed using the porcine Affymetrix-Chip, suggesting that it may be implicated with successful conceptus attachment in this breed. 38 As another example, we found a signature of selection for a locus on Chromosome 9 (window #1296), which includes the twist family bHLH transcription factor 1 gene (TWIST1). The genomic region showed a distinctly lower ZHp of −3.02 in KNP (Supplementary  Table S5) compared with the other four breed populations (0.78, −1.17, 0.85, and 0.53 for KWB, DUR, LAN, and YOR, respectively). TWIST1 is implicated in cell lineage differentiation and is associated with breast cancer and Saethre-Chotzen syndrome in humans. 39,40 Furthermore, its expression was low in obese subjects and increased after weight loss, suggesting a prospective role in obesity. 41,42 Despite its lower growth rate and litter size, KNP is known to have the higher intramuscular fat content (marbling) than typical imported breeds. The marbling is an important palatability factor appealing to many Korean pork consumers; however, limited numbers of genes have been identified to account for genetic variations in the trait in KNP. These examples show the potential of our analysis to identify candidate genes to dissect genetic mechanisms implicated in economically important traits in diverse pig breeds (Fig. 4).

Conclusions
Here, we present extensive whole-genome resequencing analyses for 55 pigs of five breeds: Korean wild boar, Korean native, Duroc, Landrace, and Yorkshire. This study had two main findings. First, we identified a significant number of SNPs across the genomes (∼20.1 million), of which substantial numbers were novel. Furthermore, we located numerous nsSNPs using a deep annotation process, which could be candidate genetic markers for predicting genetic variations in traits of interest. Second, multiple methodologies were applied to the detected SNPs to dissect genomic features of the pig breeds, revealing obvious signals of genetic differentiation among the breeds. Furthermore, the whole genomes were scanned to detect signatures of selection by accessing excess homozygosity throughout the genome. We observed many higher homozygosity regions, which are unevenly distributed across the genome. The results allowed us to retrieve several interesting genomic regions that may be associated with economically important traits in pigs.