-
PDF
- Split View
-
Views
-
Cite
Cite
Chanwoo Kim, Young Jin Kim, Wanson Choi, Hye-Mi Jang, Mi Yeong Hwang, Sunwoo Jung, Hyunjoon Lim, Sang Bin Hong, Kyungheon Yoon, Bong-Jo Kim, Hyun-Young Park, Buhm Han, Phenome-wide association study of the major histocompatibility complex region in the Korean population identifies novel association signals, Human Molecular Genetics, Volume 31, Issue 15, 1 August 2022, Pages 2655–2667, https://doi.org/10.1093/hmg/ddac016
- Share Icon Share
Abstract
Human leukocyte antigen (HLA) gene variants in the major histocompatibility complex (MHC) region are associated with numerous complex human diseases and quantitative traits. Previous phenome-wide association studies (PheWAS) for this region demonstrated that HLA association patterns to the phenome have both population-specific and population-shared components. We performed MHC PheWAS in the Korean population by analyzing associations between phenotypes and genetic variants in the MHC region using the Korea Biobank Array project data samples from the Korean Genome and Epidemiology Study cohorts. Using this single-population dataset, we curated and analyzed 82 phenotypes for 125 673 Korean individuals after imputing HLA using CookHLA, a recently developed imputation framework. More than one-third of these phenotypes showed significant associations, confirming 56 known associations and discovering 13 novel association signals that were not reported previously. In addition, we analyzed heritability explained by the variants in the MHC region and genetic correlations among phenotypes based on the MHC variants.
Introduction
Genetic variants in major histocompatibility complex (MHC) region located at 6p21.3 show significant associations with many complex traits. Although this region occupies only 0.13% of the human genome, it contains more than 200 genes and is known to be associated with numerous complex human diseases and quantitative traits (1–3). Among the genes in this region, human leukocyte antigen (HLA) genes are the key determinants of susceptibility and resistance to many inflammatory and autoimmune diseases such as type 1 diabetes (4,5), Crohn’s disease (6) and asthma (7). Due to the critical roles of HLA, it has been of utmost interest which HLA alleles and amino acids are driving the diseases. To find this answer, previous investigations have utilized the genome-wide association study (GWAS) data of each phenotype and fine-mapped variants in the MHC region via the imputation technology. These efforts led to the identification of many HLA variants associated with numerous immune-mediated phenotypes (8,9).
MHC is characterized by its cross-phenotype associations. Therefore, although many previous HLA fine-mapping studies were conducted based on a single phenotype, phenome-wide association studies (PheWAS) have the potential to identify new associations in this region (5,10,11). Recently, several researchers have conducted PheWAS on the MHC region. Karnes et al. applied PheWAS to 1368 phenotypes for 37 270 European individuals to assess the genetic risk of HLA variants (4). More recently, Hirata et al. applied PheWAS to 106 phenotypes for 116 190 Japanese individuals using population-specific deep whole-genome sequencing data (12). These studies demonstrated that HLA association patterns to the phenome can have both population-specific and population-shared components. Thus, to gain a full understanding of the genetic architecture of the MHC region, PheWAS analyses in diverse populations with an expansive set of phenotypes will be required.
In this study, we performed a comprehensive MHC PheWAS analysis in the Korean population using phenotypic and genotypic information of 125 673 Korean individuals (13). We used the Korea Biobank Array (KBA) project data samples from the Korean Genome and Epidemiology Study (KoGES) cohorts. These individuals were genotyped using the KBA, a customized microarray designed for the Korean population (14). The phenotypes included answers to an extensive set of questionnaires along with data from physical examinations and clinical investigations collected during the baseline and follow-up phases.
Using this single-population dataset, we wanted to both confirm previously reported associations and discover novel associations between HLA variants and phenotypes. To this end, we curated 82 overlapping phenotypes across three cohorts in the KoGES data. We imputed genotypes of eight classical HLA genes: HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1 and HLA-DPB1, using CookHLA, a recently developed imputation framework (15). In total, we obtained 9097 HLA binary markers that represent 119 four-digit alleles and 349 amino acid residues for the eight HLA genes. In addition to the markers for the HLA genes, we utilized a total of 61 890 single-nucleotide variants (SNVs) and indels in the extended MHC region (28–35 Mb), which we expect to tag other non-HLA and non-classical HLA genes. Our PheWAS analysis confirmed 56 known associations and discovered 13 novel association signals that were not reported previously. In addition, we analyzed heritability explained by the MHC variants for each phenotype and genetic correlations among phenotypes based on the MHC variants.
Results
Single-population PheWAS identifies novel association patterns in the MHC region
To comprehensively investigate the influence of HLA variants on complex diseases and quantitative traits in Koreans, we extracted 82 phenotypes (47 binary and 35 continuous phenotypes) from questionnaires and examination items in the KoGES data (13) (Methods, Supplementary Material, Table S1). The KBA, a microarray designed for Korean population, was built to achieve a high SNV density in MHC (5292 SNVs in chr6: 29–34 Mb) (14). Based on these SNVs, we imputed HLA using CookHLA (15) with a merged HLA reference panel (16) (N = 10 873) from the Chinese reference panel (17) and the Pan-Asian panel (18,19) (Supplementary Material, Tables S2 and S3). We then used HLA Analysis ToolKit (20) to define binary markers that correspond to the presence or absence of HLA alleles, amino acid residues and exonic SNV alleles in the MHC region. In addition to these markers, we imputed 56 598 known SNVs and indels across the extended MHC region using the IMPUTE4 software (21) with a merged reference panel of 1000 Genomes project phase 3 (22) and Korean Reference Genome (14). In total, we used 70 987 variants (5292 KBA genotyped SNVs, 9097 imputed HLA binary markers and 56 598 imputed SNVs and indels).
Using these variants, we conducted PheWAS with 82 phenotypes. Among the 82 phenotypes, 33 (6 binary and 27 continuous) phenotypes showed an association signal that satisfied the genome-wide significance threshold of P < 5.0 × 10−8 (Table 1). Thus, more than one-third of phenotypes analyzed showed significant MHC associations. This observation was consistent with previous studies that found pleiotropic cross-phenotype associations in the MHC region (12).
Association signal of P-values above significance threshold identified by PheWAS. Allele frequencies were omitted for omnibus test results
Phenotype . | N (case/control) . | Variant . | Position (GRCh37) . | A1/A2 . | Gene . | How assigned . | A1 freq (cases, controls) . | Coef (std) . | P . |
---|---|---|---|---|---|---|---|---|---|
Cardiovascular disease | |||||||||
Hypertension | 25 614/78 950 | HLA-B pos.80 | 31 324 497 | T/others | HLA-B | gene | 0.225, 0.211 | 0.095 (0.0130) | 2.26E−13 |
Immune-related disease | |||||||||
Allergic disease | 9273/78 950 | rs9274326 | 32 632 237 | C/T | HLA-DQB1 | gene | 0.450, 0.428 | 0.089 (0.0158) | 2.10E−08 |
Kidney-related disease | |||||||||
Hematuria | 1817/14 938 | rs2517754 | 29 896 680 | A/G | HLA-A | LD | 0.384, 0.433 | −0.201 (0.0364) | 3.09E−08 |
Metabolic disease | |||||||||
Hyperlipidemia | 11 043/78 950 | rs12661142 | 32 639 786 | A/G | HLA-DQB1 | LD | 0.066, 0.052 | 0.262 (0.0301) | 3.23E−18 |
Type 2 diabetes | 8662/78 950 | rs2797963 | 34 185 749 | C/T | HMGA1 | LD | 0.136, 0.156 | −0.165 (0.0241) | 7.40E−12 |
Thyroid-related disease | |||||||||
Thyroid disease | 5139/75 944 | 6:32608334 | 32 608 334 | G/GAAA | HLA-DQA1 | gene | 0.352, 0.399 | −0.206 (0.0215) | 9.95E−22 |
Anthropometric QTL | |||||||||
Body mass index | 124 539 | HLA-DQA1 pos.69 | 32 609 285 | L/others | HLA-DQA1 | gene | 0.37 | −0.071 (0.0116) | 1.18E−09 |
Grip strength | 59 747 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.058 | 0.401 (0.0734) | 4.61E−08 |
Height | 125 264 | 6:34207474 | 34 207 474 | T/TAA | HMGA1 | gene | 0.161 | 0.525 (0.0283) | 2.05E−76 |
Hip circumference | 124 153 | rs78420081 | 34 231 005 | C/T | HMGA1 | LD | 0.125 | 0.361 (0.0337) | 8.74E−27 |
Body Weight | 124 693 | 6:34205465 | 34 205 465 | G/GGAGCCC | HMGA1 | gene | 0.163 | 0.553 (0.0441) | 5.13E−36 |
Blood-pressure QTL | |||||||||
Diastolic blood pressure | 124 766 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.261 (0.0392) | 2.87E−11 |
Systolic blood pressure | 124 467 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.370 (0.0585) | 2.51E−10 |
Hematological QTL | |||||||||
Hematocrit | 87 158 | rs9391855 | 32 149 801 | T/C | AGER | gene | 0.166 | −0.134 (0.0187) | 9.20E−13 |
Hemoglobin | 120 831 | rs9394078 | 31 868 938 | T/C | C2 | gene | 0.088 | −0.058 (0.0076) | 1.54E−14 |
Mean corpuscular hemoglobin | 77 812 | HLA-C pos.114 | 31 239 060 | N/D | HLA-C | gene | 0.172 | 0.083 (0.0100) | 1.11E−16 |
Mean corpuscular volume | 77 977 | HLA-C pos.275 | 31 237 869 | E/G/K | HLA-C | gene | – | – | 1.30E−18 |
Platelet count | 86 087 | rs9296095 | 33 542 523 | CT | BAK1 | gene | 0.223 | 5.566 (0.3155) | 1.58E−69 |
Red blood cell count | 87 356 | HLA-B pos.69 | 31 324 530 | A/R/T | HLA-B | gene | – | – | 1.49E−22 |
White blood cell count | 86 639 | rs17886232 | 31 237 965 | G/C | HLA-C | gene | 0.358 | 0.119 (0.0075) | 1.57E−56 |
Kidney-related QTL | |||||||||
Blood urea nitrogen | 124 563 | rs3095337 | 30 737 591 | C/G | HLA-E | LD | 0.073 | −0.186 (0.0269) | 4.12E−12 |
Creatinine | 125 191 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.057 | 0.012 (0.0011) | 1.04E−31 |
Liver-related QTL | |||||||||
Alanine aminotransferase | 120 788 | rs2905715 | 31 351 530 | C/T | MICA | LD | 0.272 | 0.327 (0.0501) | 6.92E−11 |
Aspartate transaminase | 121 162 | rs117654833 | 32 202 630 | T/G | HLA-DRB1 | LD | 0.072 | 0.444 (0.0584) | 2.92E−14 |
Metabolic QTL | |||||||||
Hemoglobin A1c | 56 650 | rs2923009 | 31 358 559 | T/C | MICA | LD | 0.24 | −0.021 (0.0033) | 5.25E−10 |
HDL cholesterol | 124 772 | rs9380294 | 32 380 909 | C/A | BTNL2 | nearest | 0.146 | 0.503 (0.0653) | 1.43E−14 |
Total cholesterol | 125 007 | HLA-DQB1 pos.57 | 32 632 681 | V/others | HLA-DQB1 | gene | 0.178 | 2.168 (0.1793) | 1.19E−33 |
Triglyceride | 123 426 | rs111524667 | 32 453 284 | G/C | HLA-DQA1 | LD | 0.09 | 6.680 (0.4653) | 1.08E−46 |
Protein QTL | |||||||||
Albumin | 125 069 | rs7762933 | 31 162 520 | T/C | HLA-C | LD | 0.447 | −0.009 (0.0010) | 1.18E−17 |
High-sensitivity C-reactive protein | 99 904 | rs915654 | 31 538 497 | T/A | MICB | LD | 0.447 | −0.004 (0.0007) | 5.73E−09 |
Protein in blood | 24 307 | 6:32324081 | 32 324 081 | AT/A | C6orf10 | gene | 0.139 | 0.030 (0.0054) | 2.62E−08 |
Other QTL | |||||||||
Menopause age | 44 843 | 6:32547722 | 32 547 722 | T/others | HLA-DRB1 | gene | 0.411 | 0.177 (0.0299) | 3.08E−09 |
Pulse rate | 121 124 | rs2534791 | 29 522 712 | G/A | UBD | LD | 0.174 | 0.254 (0.0459) | 3.10E−08 |
Phenotype . | N (case/control) . | Variant . | Position (GRCh37) . | A1/A2 . | Gene . | How assigned . | A1 freq (cases, controls) . | Coef (std) . | P . |
---|---|---|---|---|---|---|---|---|---|
Cardiovascular disease | |||||||||
Hypertension | 25 614/78 950 | HLA-B pos.80 | 31 324 497 | T/others | HLA-B | gene | 0.225, 0.211 | 0.095 (0.0130) | 2.26E−13 |
Immune-related disease | |||||||||
Allergic disease | 9273/78 950 | rs9274326 | 32 632 237 | C/T | HLA-DQB1 | gene | 0.450, 0.428 | 0.089 (0.0158) | 2.10E−08 |
Kidney-related disease | |||||||||
Hematuria | 1817/14 938 | rs2517754 | 29 896 680 | A/G | HLA-A | LD | 0.384, 0.433 | −0.201 (0.0364) | 3.09E−08 |
Metabolic disease | |||||||||
Hyperlipidemia | 11 043/78 950 | rs12661142 | 32 639 786 | A/G | HLA-DQB1 | LD | 0.066, 0.052 | 0.262 (0.0301) | 3.23E−18 |
Type 2 diabetes | 8662/78 950 | rs2797963 | 34 185 749 | C/T | HMGA1 | LD | 0.136, 0.156 | −0.165 (0.0241) | 7.40E−12 |
Thyroid-related disease | |||||||||
Thyroid disease | 5139/75 944 | 6:32608334 | 32 608 334 | G/GAAA | HLA-DQA1 | gene | 0.352, 0.399 | −0.206 (0.0215) | 9.95E−22 |
Anthropometric QTL | |||||||||
Body mass index | 124 539 | HLA-DQA1 pos.69 | 32 609 285 | L/others | HLA-DQA1 | gene | 0.37 | −0.071 (0.0116) | 1.18E−09 |
Grip strength | 59 747 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.058 | 0.401 (0.0734) | 4.61E−08 |
Height | 125 264 | 6:34207474 | 34 207 474 | T/TAA | HMGA1 | gene | 0.161 | 0.525 (0.0283) | 2.05E−76 |
Hip circumference | 124 153 | rs78420081 | 34 231 005 | C/T | HMGA1 | LD | 0.125 | 0.361 (0.0337) | 8.74E−27 |
Body Weight | 124 693 | 6:34205465 | 34 205 465 | G/GGAGCCC | HMGA1 | gene | 0.163 | 0.553 (0.0441) | 5.13E−36 |
Blood-pressure QTL | |||||||||
Diastolic blood pressure | 124 766 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.261 (0.0392) | 2.87E−11 |
Systolic blood pressure | 124 467 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.370 (0.0585) | 2.51E−10 |
Hematological QTL | |||||||||
Hematocrit | 87 158 | rs9391855 | 32 149 801 | T/C | AGER | gene | 0.166 | −0.134 (0.0187) | 9.20E−13 |
Hemoglobin | 120 831 | rs9394078 | 31 868 938 | T/C | C2 | gene | 0.088 | −0.058 (0.0076) | 1.54E−14 |
Mean corpuscular hemoglobin | 77 812 | HLA-C pos.114 | 31 239 060 | N/D | HLA-C | gene | 0.172 | 0.083 (0.0100) | 1.11E−16 |
Mean corpuscular volume | 77 977 | HLA-C pos.275 | 31 237 869 | E/G/K | HLA-C | gene | – | – | 1.30E−18 |
Platelet count | 86 087 | rs9296095 | 33 542 523 | CT | BAK1 | gene | 0.223 | 5.566 (0.3155) | 1.58E−69 |
Red blood cell count | 87 356 | HLA-B pos.69 | 31 324 530 | A/R/T | HLA-B | gene | – | – | 1.49E−22 |
White blood cell count | 86 639 | rs17886232 | 31 237 965 | G/C | HLA-C | gene | 0.358 | 0.119 (0.0075) | 1.57E−56 |
Kidney-related QTL | |||||||||
Blood urea nitrogen | 124 563 | rs3095337 | 30 737 591 | C/G | HLA-E | LD | 0.073 | −0.186 (0.0269) | 4.12E−12 |
Creatinine | 125 191 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.057 | 0.012 (0.0011) | 1.04E−31 |
Liver-related QTL | |||||||||
Alanine aminotransferase | 120 788 | rs2905715 | 31 351 530 | C/T | MICA | LD | 0.272 | 0.327 (0.0501) | 6.92E−11 |
Aspartate transaminase | 121 162 | rs117654833 | 32 202 630 | T/G | HLA-DRB1 | LD | 0.072 | 0.444 (0.0584) | 2.92E−14 |
Metabolic QTL | |||||||||
Hemoglobin A1c | 56 650 | rs2923009 | 31 358 559 | T/C | MICA | LD | 0.24 | −0.021 (0.0033) | 5.25E−10 |
HDL cholesterol | 124 772 | rs9380294 | 32 380 909 | C/A | BTNL2 | nearest | 0.146 | 0.503 (0.0653) | 1.43E−14 |
Total cholesterol | 125 007 | HLA-DQB1 pos.57 | 32 632 681 | V/others | HLA-DQB1 | gene | 0.178 | 2.168 (0.1793) | 1.19E−33 |
Triglyceride | 123 426 | rs111524667 | 32 453 284 | G/C | HLA-DQA1 | LD | 0.09 | 6.680 (0.4653) | 1.08E−46 |
Protein QTL | |||||||||
Albumin | 125 069 | rs7762933 | 31 162 520 | T/C | HLA-C | LD | 0.447 | −0.009 (0.0010) | 1.18E−17 |
High-sensitivity C-reactive protein | 99 904 | rs915654 | 31 538 497 | T/A | MICB | LD | 0.447 | −0.004 (0.0007) | 5.73E−09 |
Protein in blood | 24 307 | 6:32324081 | 32 324 081 | AT/A | C6orf10 | gene | 0.139 | 0.030 (0.0054) | 2.62E−08 |
Other QTL | |||||||||
Menopause age | 44 843 | 6:32547722 | 32 547 722 | T/others | HLA-DRB1 | gene | 0.411 | 0.177 (0.0299) | 3.08E−09 |
Pulse rate | 121 124 | rs2534791 | 29 522 712 | G/A | UBD | LD | 0.174 | 0.254 (0.0459) | 3.10E−08 |
Association signal of P-values above significance threshold identified by PheWAS. Allele frequencies were omitted for omnibus test results
Phenotype . | N (case/control) . | Variant . | Position (GRCh37) . | A1/A2 . | Gene . | How assigned . | A1 freq (cases, controls) . | Coef (std) . | P . |
---|---|---|---|---|---|---|---|---|---|
Cardiovascular disease | |||||||||
Hypertension | 25 614/78 950 | HLA-B pos.80 | 31 324 497 | T/others | HLA-B | gene | 0.225, 0.211 | 0.095 (0.0130) | 2.26E−13 |
Immune-related disease | |||||||||
Allergic disease | 9273/78 950 | rs9274326 | 32 632 237 | C/T | HLA-DQB1 | gene | 0.450, 0.428 | 0.089 (0.0158) | 2.10E−08 |
Kidney-related disease | |||||||||
Hematuria | 1817/14 938 | rs2517754 | 29 896 680 | A/G | HLA-A | LD | 0.384, 0.433 | −0.201 (0.0364) | 3.09E−08 |
Metabolic disease | |||||||||
Hyperlipidemia | 11 043/78 950 | rs12661142 | 32 639 786 | A/G | HLA-DQB1 | LD | 0.066, 0.052 | 0.262 (0.0301) | 3.23E−18 |
Type 2 diabetes | 8662/78 950 | rs2797963 | 34 185 749 | C/T | HMGA1 | LD | 0.136, 0.156 | −0.165 (0.0241) | 7.40E−12 |
Thyroid-related disease | |||||||||
Thyroid disease | 5139/75 944 | 6:32608334 | 32 608 334 | G/GAAA | HLA-DQA1 | gene | 0.352, 0.399 | −0.206 (0.0215) | 9.95E−22 |
Anthropometric QTL | |||||||||
Body mass index | 124 539 | HLA-DQA1 pos.69 | 32 609 285 | L/others | HLA-DQA1 | gene | 0.37 | −0.071 (0.0116) | 1.18E−09 |
Grip strength | 59 747 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.058 | 0.401 (0.0734) | 4.61E−08 |
Height | 125 264 | 6:34207474 | 34 207 474 | T/TAA | HMGA1 | gene | 0.161 | 0.525 (0.0283) | 2.05E−76 |
Hip circumference | 124 153 | rs78420081 | 34 231 005 | C/T | HMGA1 | LD | 0.125 | 0.361 (0.0337) | 8.74E−27 |
Body Weight | 124 693 | 6:34205465 | 34 205 465 | G/GGAGCCC | HMGA1 | gene | 0.163 | 0.553 (0.0441) | 5.13E−36 |
Blood-pressure QTL | |||||||||
Diastolic blood pressure | 124 766 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.261 (0.0392) | 2.87E−11 |
Systolic blood pressure | 124 467 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.370 (0.0585) | 2.51E−10 |
Hematological QTL | |||||||||
Hematocrit | 87 158 | rs9391855 | 32 149 801 | T/C | AGER | gene | 0.166 | −0.134 (0.0187) | 9.20E−13 |
Hemoglobin | 120 831 | rs9394078 | 31 868 938 | T/C | C2 | gene | 0.088 | −0.058 (0.0076) | 1.54E−14 |
Mean corpuscular hemoglobin | 77 812 | HLA-C pos.114 | 31 239 060 | N/D | HLA-C | gene | 0.172 | 0.083 (0.0100) | 1.11E−16 |
Mean corpuscular volume | 77 977 | HLA-C pos.275 | 31 237 869 | E/G/K | HLA-C | gene | – | – | 1.30E−18 |
Platelet count | 86 087 | rs9296095 | 33 542 523 | CT | BAK1 | gene | 0.223 | 5.566 (0.3155) | 1.58E−69 |
Red blood cell count | 87 356 | HLA-B pos.69 | 31 324 530 | A/R/T | HLA-B | gene | – | – | 1.49E−22 |
White blood cell count | 86 639 | rs17886232 | 31 237 965 | G/C | HLA-C | gene | 0.358 | 0.119 (0.0075) | 1.57E−56 |
Kidney-related QTL | |||||||||
Blood urea nitrogen | 124 563 | rs3095337 | 30 737 591 | C/G | HLA-E | LD | 0.073 | −0.186 (0.0269) | 4.12E−12 |
Creatinine | 125 191 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.057 | 0.012 (0.0011) | 1.04E−31 |
Liver-related QTL | |||||||||
Alanine aminotransferase | 120 788 | rs2905715 | 31 351 530 | C/T | MICA | LD | 0.272 | 0.327 (0.0501) | 6.92E−11 |
Aspartate transaminase | 121 162 | rs117654833 | 32 202 630 | T/G | HLA-DRB1 | LD | 0.072 | 0.444 (0.0584) | 2.92E−14 |
Metabolic QTL | |||||||||
Hemoglobin A1c | 56 650 | rs2923009 | 31 358 559 | T/C | MICA | LD | 0.24 | −0.021 (0.0033) | 5.25E−10 |
HDL cholesterol | 124 772 | rs9380294 | 32 380 909 | C/A | BTNL2 | nearest | 0.146 | 0.503 (0.0653) | 1.43E−14 |
Total cholesterol | 125 007 | HLA-DQB1 pos.57 | 32 632 681 | V/others | HLA-DQB1 | gene | 0.178 | 2.168 (0.1793) | 1.19E−33 |
Triglyceride | 123 426 | rs111524667 | 32 453 284 | G/C | HLA-DQA1 | LD | 0.09 | 6.680 (0.4653) | 1.08E−46 |
Protein QTL | |||||||||
Albumin | 125 069 | rs7762933 | 31 162 520 | T/C | HLA-C | LD | 0.447 | −0.009 (0.0010) | 1.18E−17 |
High-sensitivity C-reactive protein | 99 904 | rs915654 | 31 538 497 | T/A | MICB | LD | 0.447 | −0.004 (0.0007) | 5.73E−09 |
Protein in blood | 24 307 | 6:32324081 | 32 324 081 | AT/A | C6orf10 | gene | 0.139 | 0.030 (0.0054) | 2.62E−08 |
Other QTL | |||||||||
Menopause age | 44 843 | 6:32547722 | 32 547 722 | T/others | HLA-DRB1 | gene | 0.411 | 0.177 (0.0299) | 3.08E−09 |
Pulse rate | 121 124 | rs2534791 | 29 522 712 | G/A | UBD | LD | 0.174 | 0.254 (0.0459) | 3.10E−08 |
Phenotype . | N (case/control) . | Variant . | Position (GRCh37) . | A1/A2 . | Gene . | How assigned . | A1 freq (cases, controls) . | Coef (std) . | P . |
---|---|---|---|---|---|---|---|---|---|
Cardiovascular disease | |||||||||
Hypertension | 25 614/78 950 | HLA-B pos.80 | 31 324 497 | T/others | HLA-B | gene | 0.225, 0.211 | 0.095 (0.0130) | 2.26E−13 |
Immune-related disease | |||||||||
Allergic disease | 9273/78 950 | rs9274326 | 32 632 237 | C/T | HLA-DQB1 | gene | 0.450, 0.428 | 0.089 (0.0158) | 2.10E−08 |
Kidney-related disease | |||||||||
Hematuria | 1817/14 938 | rs2517754 | 29 896 680 | A/G | HLA-A | LD | 0.384, 0.433 | −0.201 (0.0364) | 3.09E−08 |
Metabolic disease | |||||||||
Hyperlipidemia | 11 043/78 950 | rs12661142 | 32 639 786 | A/G | HLA-DQB1 | LD | 0.066, 0.052 | 0.262 (0.0301) | 3.23E−18 |
Type 2 diabetes | 8662/78 950 | rs2797963 | 34 185 749 | C/T | HMGA1 | LD | 0.136, 0.156 | −0.165 (0.0241) | 7.40E−12 |
Thyroid-related disease | |||||||||
Thyroid disease | 5139/75 944 | 6:32608334 | 32 608 334 | G/GAAA | HLA-DQA1 | gene | 0.352, 0.399 | −0.206 (0.0215) | 9.95E−22 |
Anthropometric QTL | |||||||||
Body mass index | 124 539 | HLA-DQA1 pos.69 | 32 609 285 | L/others | HLA-DQA1 | gene | 0.37 | −0.071 (0.0116) | 1.18E−09 |
Grip strength | 59 747 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.058 | 0.401 (0.0734) | 4.61E−08 |
Height | 125 264 | 6:34207474 | 34 207 474 | T/TAA | HMGA1 | gene | 0.161 | 0.525 (0.0283) | 2.05E−76 |
Hip circumference | 124 153 | rs78420081 | 34 231 005 | C/T | HMGA1 | LD | 0.125 | 0.361 (0.0337) | 8.74E−27 |
Body Weight | 124 693 | 6:34205465 | 34 205 465 | G/GGAGCCC | HMGA1 | gene | 0.163 | 0.553 (0.0441) | 5.13E−36 |
Blood-pressure QTL | |||||||||
Diastolic blood pressure | 124 766 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.261 (0.0392) | 2.87E−11 |
Systolic blood pressure | 124 467 | rs2071277 | 32 171 683 | C/T | NOTCH4 | gene | 0.423 | 0.370 (0.0585) | 2.51E−10 |
Hematological QTL | |||||||||
Hematocrit | 87 158 | rs9391855 | 32 149 801 | T/C | AGER | gene | 0.166 | −0.134 (0.0187) | 9.20E−13 |
Hemoglobin | 120 831 | rs9394078 | 31 868 938 | T/C | C2 | gene | 0.088 | −0.058 (0.0076) | 1.54E−14 |
Mean corpuscular hemoglobin | 77 812 | HLA-C pos.114 | 31 239 060 | N/D | HLA-C | gene | 0.172 | 0.083 (0.0100) | 1.11E−16 |
Mean corpuscular volume | 77 977 | HLA-C pos.275 | 31 237 869 | E/G/K | HLA-C | gene | – | – | 1.30E−18 |
Platelet count | 86 087 | rs9296095 | 33 542 523 | CT | BAK1 | gene | 0.223 | 5.566 (0.3155) | 1.58E−69 |
Red blood cell count | 87 356 | HLA-B pos.69 | 31 324 530 | A/R/T | HLA-B | gene | – | – | 1.49E−22 |
White blood cell count | 86 639 | rs17886232 | 31 237 965 | G/C | HLA-C | gene | 0.358 | 0.119 (0.0075) | 1.57E−56 |
Kidney-related QTL | |||||||||
Blood urea nitrogen | 124 563 | rs3095337 | 30 737 591 | C/G | HLA-E | LD | 0.073 | −0.186 (0.0269) | 4.12E−12 |
Creatinine | 125 191 | rs2301748 | 31 371 397 | T/C | MICA | gene | 0.057 | 0.012 (0.0011) | 1.04E−31 |
Liver-related QTL | |||||||||
Alanine aminotransferase | 120 788 | rs2905715 | 31 351 530 | C/T | MICA | LD | 0.272 | 0.327 (0.0501) | 6.92E−11 |
Aspartate transaminase | 121 162 | rs117654833 | 32 202 630 | T/G | HLA-DRB1 | LD | 0.072 | 0.444 (0.0584) | 2.92E−14 |
Metabolic QTL | |||||||||
Hemoglobin A1c | 56 650 | rs2923009 | 31 358 559 | T/C | MICA | LD | 0.24 | −0.021 (0.0033) | 5.25E−10 |
HDL cholesterol | 124 772 | rs9380294 | 32 380 909 | C/A | BTNL2 | nearest | 0.146 | 0.503 (0.0653) | 1.43E−14 |
Total cholesterol | 125 007 | HLA-DQB1 pos.57 | 32 632 681 | V/others | HLA-DQB1 | gene | 0.178 | 2.168 (0.1793) | 1.19E−33 |
Triglyceride | 123 426 | rs111524667 | 32 453 284 | G/C | HLA-DQA1 | LD | 0.09 | 6.680 (0.4653) | 1.08E−46 |
Protein QTL | |||||||||
Albumin | 125 069 | rs7762933 | 31 162 520 | T/C | HLA-C | LD | 0.447 | −0.009 (0.0010) | 1.18E−17 |
High-sensitivity C-reactive protein | 99 904 | rs915654 | 31 538 497 | T/A | MICB | LD | 0.447 | −0.004 (0.0007) | 5.73E−09 |
Protein in blood | 24 307 | 6:32324081 | 32 324 081 | AT/A | C6orf10 | gene | 0.139 | 0.030 (0.0054) | 2.62E−08 |
Other QTL | |||||||||
Menopause age | 44 843 | 6:32547722 | 32 547 722 | T/others | HLA-DRB1 | gene | 0.411 | 0.177 (0.0299) | 3.08E−09 |
Pulse rate | 121 124 | rs2534791 | 29 522 712 | G/A | UBD | LD | 0.174 | 0.254 (0.0459) | 3.10E−08 |
To find secondary association signals, we further conducted stepwise conditional analysis for each phenotype. After conditioning on the top signal, we were able to identify 36 additional signals independent from the top signals (Supplementary Material, Table S4). The average number of independent associations for each phenotype in the stepwise analysis was 2.1, which was in line with the previous finding (12). Platelet count and height showed the largest numbers of independent association signals, seven and five signals, respectively.
To assess the novelty of our findings, we compared our result with previously reported associations using the NHGRI-EBI GWAS Catalog (23). For each association consisting of a phenotype and a variant, we checked if any variants within 500 kb from the variant were previously reported for the same phenotype (Supplementary Material, Table S5). If no such previous report was found in the GWAS Catalog, we defined the association novel. As a result, we found that 13 association signals were novel among the 69 independent associations signals (Fig. 1). These novel associations emphasized the utility of conducting a large-scale PheWAS in the MHC region for discovering population-specific association signals in a homogeneous population. In the following, we describe our findings in each of the 13 phenotypic categories for which we found associations.

Matrix of phenotype and gene assigned to association signal. Significant associations between phenotype and gene are shown. In addition to the top association signal of each phenotype, we indicate further independent associations found by conditional analysis. In the case of multiple association signals mapped to a single gene, only the most significant signal is indicated. The novel association signals are marked with thick black edges. The thin bars at the bottom show the total number of association signals per gene.
Cardiovascular diseases
We tested MHC associations for five cardiovascular diseases in the KoGES data: angina pectoris/myocardial infarction, heart failure, hypertension, stroke and transient ischemic attack. Among them, we observed an association between hypertension and HLA-B amino acid position 80 (T residue vs others, P = 2.26 × 10−13). This observation confirmed a previous report that an SNV (rs2524095) proximal to this amino acid position (58.4 kb) was associated with hypertension (P = 1.00 × 10−9) (24). We note that although one of our cohorts was named cardiovascular disease association study (CAVAS), it was a population-based cohort contrary to its name, and the allele frequency of this amino acid residue in this cohort (0.220 in cases and 0.206 in controls) was similar to the other two cohorts, HEXA (0.227 in cases and 0.211 in controls) and the community-based cohort (0.226 in cases and 0.218 in controls).
Immune-related diseases
Testing MHC associations in two immune-related diseases in the KoGES data, allergic disease and asthma, revealed an association between allergic disease and an SNV (rs9274326) within HLA-DQB1 gene (P = 2.10 × 10−8). This observation was consistent with previous studies (7,25–29), which reported several SNVs proximal to HLA-DQB1 gene to be associated with many allergic diseases including self-reported allergies, peanut allergy, hydrolyzed wheat protein allergy, allergic rhinitis, asthma, hay fever and eczema. Due to the broad phenotypic definition of allergy in the KoGES data, we were not able to identify which specific allergies were driving the signal.
Kidney-related diseases
Two kidney-related diseases in the KoGES data, hematuria and renal failure, were tested for MHC associations. We observed an association between hematuria and an SNV (rs2517754, P = 3.09 × 10−8) which was proximal to HLA-A gene (12.4 kb) and located in HLA-K gene (pseudogene). We did not find any previous studies that reported significant associations within 500 kb from this SNV.
Metabolic diseases
We tested for MHC associations in two metabolic diseases in the KoGES data: hyperlipidemia and type 2 diabetes. We observed an association between hyperlipidemia and an SNV (rs12661142, P = 3.23 × 10−18) located near HLA-DQB1 gene (12.5 kb). This observation was consistent with previous studies that reported associations between hyperlipidemia and SNVs located in HLA-DQB1 gene (30–35). We observed an association between type 2 diabetes and an SNV (rs2797963, P = 7.40 × 10−12) close to HMGA1 gene (18.9 kb). This observation was consistent with a previous study that reported an SNV (rs4711389) proximal to the SNV (28.9 kb) to be associated with type 2 diabetes (36) (P = 7.00 × 10−17).
Thyroid-related diseases
The questionnaire for KoGES included a general question asking whether the participant has a history of any thyroid-related disease. We observed four independent associations for thyroid disease in the MHC region. This observation replicated results from previous studies (37–39), which reported associated loci within 500 kb from our signals (40).
Anthropometric QTL
Six anthropometric traits were included in the KoGES data: body mass index (BMI), grip strength, height, hip circumference, waist circumference, and body weight. We observed association signals for all of them except for waist circumference. Notably, the following traits showed more than one independent associational signal: height (5), body weight (4), hip circumference (3) and BMI (2). For 12 out of the 15 association signals for the anthropometric traits, we were able to identify previously reported associated loci within 500 kb from them (40–46).
Blood pressure QTL
Our tests identified both diastolic blood pressure (P = 2.87 × 10−11) and systolic blood pressure (P = 2.51 × 10−10) to be associated with an SNV (rs2071277) located within NOTCH4 gene. This result replicates previous studies that identified significantly associated loci within 500 kb from this SNV (47,48).
Hematological QTL
Eight hematological traits were included in the KoGES data: hematocrit, hemoglobin, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, mean corpuscular volume, platelet count red blood cell (RBC) count, and white blood cell (WBC) count. Seven out of the eight showed association signals, except for the mean corpuscular hemoglobin concentration. Notably, many traits showed more than one independent associational signal, including platelet (7), RBC count (4), WBC count (4), hematocrit (3) and mean corpuscular volume (2). Many of the identified genes turned out to be associated with multiple hematological traits: HLA-B (5), HLA-C (4), HMGA1 (3), BAK (2), HLA-A (2) and HLA-DRB1 (2). Our findings were consistent with previous reports detailing MHC associations with these hematological traits. For 20 out of the 22 association signals for the hematological traits, we were able to identify previously reported significantly associated loci within 500 kb from them (49–53).
Kidney-related QTL
We found both kidney-related quantitative traits in our analysis, blood urea nitrogen and creatinine, to be associated with variants in the MHC region. For blood urea nitrogen, the top association was rs3095337 (P = 4.12 × 10−12), which was located near HLA-E gene (280.3 kb). We were unable to find a study that previously reported this association. For creatinine, the top association was rs2301748 (P = 1.04 × 10−31) within MICA gene, and another independent association was rs34794906 (P = 5.27 × 10−12) within HLA-C gene. The two signals for creatinine replicated the result from a previous study, which found signals within 500 kb from them (52).
Liver-related QTL
We tested associations for three liver-related quantitative traits in the KoGES data: alanine aminotransferase (ALT), aspartate transaminase (AST) and γ-glutamyl transpeptidase. Association signals were found for ALT and AST. For ALT, the top association was rs2905715 (P = 6.92 × 10−11), which was located near MICA gene (16.0 kb). Another independent association for ALT was rs28673163 (P = 4.63 × 10−8), which was proximal to HMGA1 gene (25.9 kb). A previous study reported that an SNV (rs185355701) proximal to rs28673163 (317.9 kb) was significantly associated with ALT (P = 3.00 × 10−8) (52). For AST, the top association was rs117654833 (P = 2.92 × 10−14), which was near HLA-DRB1 gene (343.9 kb). A previous study reported that an SNV (rs115695709) proximal to rs117654833 (11.3 kb) was significantly associated with AST (P = 3.00 × 10−16) (52). Another independent association for AST was rs742870 (P = 6.02 × 10−9), which was proximal to HLA-DPB1 gene (32.5 kb).
Metabolic QTL
All four metabolic quantitative traits in the KoGES data, HDL cholesterol, total cholesterol, triglyceride and hemoglobin A1c, were found to be associated with variants in the MHC region. All of the association signals other than the association of an SNV (rs2923009) with hemoglobin A1c (P = 5.25 × 10−10) were previously known. The associations of triglyceride with an SNV (rs111524667, P = 1.08 × 10−46), which was near HLA-DQA1 (142.7 kb) (33) and another independent SNV (rs2797963, P = 1.17 × 10−13), which was near HMGA1 (54), replicated the results from previous studies. Total cholesterol was associated with HLA-DQB1 amino acid position 57 (P = 1.19 × 10−33), confirming the result from a previous article (30). All associations with HDL cholesterol we found were consistent with the results of earlier studies (30,52).
Protein QTL
We tested associations for four protein-related quantitative traits in the KoGES data: albumin, high-sensitivity C-reactive protein, protein in blood and total bilirubin. All of them were found to be associated with variants in the MHC region. For albumin, the top association was rs7762933 (P = 6.92 × 10−11), which was proximal to HLA-C gene (74.0 kb). The next two most significant independent associations were rs928482 and rs4713438, which were located within C6orf1 gene and POU5F1 gene, respectively. For high-sensitivity C-reactive protein, the top association signal was rs915654, which was proximal to MICB gene (75.8 kb). For protein in blood, the top association signal was an SNV (6: 32324081, GRCh37), which was located near C6orf10 (115.0 kb). For four out of the five association signals (except for rs928482) for the protein traits, we were able to identify previously reported significantly associated loci within 500 kb from them (52,55,56).
Other QTL
The KoGES data contained quantitative traits that were difficult to classify into any of the aforementioned categories. These traits were child delivery count, menarche onset age, menopause age, menopause cycle, pulse rate and personal well-being index. Among these six phenotypic traits, menopause age and pulse rate were associated with variants in the MHC region. We found menopause age to be associated with an SNV (6: 32547722, P = 3.08 × 10−9) within HLA-DRB1 gene. We also found pulse rate to be associated with an SNV (rs2534791, P = 3.10 × 10−8), which was in linkage disequilibrium (LD) with UBD gene.
Pleiotropic genes associated with multiple phenotypes
We counted how many phenotypes were associated with each gene, after assigning each independent association to a gene (Methods, Fig. 1). Notably, the gene that showed associations with the largest number of phenotypes (11) was HMGA1 gene, which is not an HLA gene. The associated traits were mostly anthropometric and hematological traits such as height (P = 2.05 × 10−76 at 6:34207474) (Fig. 2A) and RBC count (P = 6.61 × 10−11 at rs118012224). Some HLA class I genes such as the HLA-B and HLA-C genes also showed a large number of associations. HLA-C was associated with six traits such as WBC count that was associated with an SNV (rs17886232) within HLA-C gene (P = 1.57 × 10−56) (Fig. 2B). HLA-B was associated with six traits such as red blood cell count that was associated with HLA-B amino acid position 69 (P = 1.49 × 10−22). The HLA-DRB1 gene and HLA-DQA1 gene, which are HLA class II genes, showed associations with five and four phenotypes, respectively. Some examples are thyroid disease (P = 9.95 × 10−22 at 6:32608334 within HLA-DQA1 gene) (Fig. 2C) and menopause (P = 3.08 × 10−9 at 6: 32547722 within HLA-DRB1 gene). Non-classical HLA genes, such as MICA (4), MICB (2) and HLA-F (2) also showed multiple associations. For example, high-sensitivity C-reactive protein was associated with rs915654 near MICA gene (Fig. 2D). We note that the absolute number of associations of a gene should not be interpreted as the measure of pleiotropy, because some categories of traits, especially anthropometric traits, were overrepresented in our analysis. All of the Manhattan plots are accessible from the Online Supplementary Material.

Association result from PheWAS. Each Manhattan plot shows one example association pattern according to the type of associated gene. (A) Height was associated with HMGA1 gene, a non-HLA gene. (B) White blood cell count was associated with HLA-C gene, a HLA class I gene. (C) Thyroid disease was associated with HLA-DQA1 gene, a HLA class II gene. (D) High-sensitivity C-reactive protein was associated with MICB gene, a non-classical HLA gene. Dotted horizontal lines represent the genome-wide significance threshold. Genetic variants on classical HLA class I genes, classical HLA class II genes, non-classical HLA genes are colored red, blue and green, respectively.
Analysis of heritability and genetic correlation within MHC in the Korean population
To look at the overall effect of the MHC region on the phenotypes, we estimated the heritability explained by all variants in the MHC region (Fig. 3A). Although cervical cancer, tuberculosis and prostate cancer did not show any signal above genome-wide significance threshold, their estimated heritability ranked at the top among the phenotypes included in our analysis. This suggested that single-variant association may not be sufficient to fully assess the genetic risk of variants in the MHC region.

Heritability estimated in the entire MHC region and genetic correlation among phenotypes. (A) We estimated the heritability of phenotypes explained by the variants in the MHC region. Although cervical cancer, tuberculosis and prostate cancer did not have any association signals above the genome-wide significance threshold, their estimated heritability ranked at the top among the phenotypes included in our analysis. This indicates that single-variant associations may not fully assess the contribution of genetic risk in the MHC region. (B) Genetic correlations between phenotypes based on the variants in the MHC region are shown. We applied the Fruchterman–Reingold algorithm, a force-directed algorithm, with weights of z score of correlation. Accordingly, phenotypes that are genetically close to one another are clustered together. As expected, phenotypes belonging to the same category tended to cluster together.
In addition, to see how phenotypes are interrelated in terms of genetic variants in the MHC region, we estimated genetic correlations within MHC among phenotypes (Fig. 3B). For visualization, we applied the Fruchterman–Reingold algorithm, a force-directed algorithm, with weights of z-scores of correlations so that more significantly correlated phenotypes are clustered together. We found that phenotypes classified into the same category showed a tendency to cluster more closely.
Comparison of the HLA allele frequencies with the previously reported ones in the Korean population
We calculated the allele frequencies of HLA alleles in our imputed genotype dataset, and then we compared them to the most recently reported HLA allele frequencies in the Korean population (57) (Supplementary Material, Table S2). The reference data was obtained by applying amplicon-based next-generation sequencing (NGS) to 26 202 Korean individuals. In general, the allele frequencies from our dataset and the reference dataset were highly similar. The mean value of the absolute difference of allele frequencies was only 0.133% with a standard deviation of 0.255%.
Discussion
In this study, we performed MHC PheWAS in the Korean population using the KBA project data samples from KoGES cohorts. Prior to our study, Karnes et al. conducted MHC PheWAS for European individuals (4), and Hirata et al. applied MHC PheWAS to the BioBank Japan (BBJ) project data (12). To our knowledge, our study is the second MHC PheWAS study conducted for the East Asian population.
Through our study, we both confirmed the results from previous studies and discovered novel association patterns between HLA variants and phenotypes. When we checked the novelty of association signals based on their physical distance, we found that 56 out of the 69 independent association signals were the replication of previously identified association signals. The phenotypes for which we found novel association signals were hematuria (1), hip circumference (1), body weight (2), platelet count (1), WBC count (1), blood urea nitrogen (1), ALT (1), AST (1), hemoglobin A1c (1), albumin (1), menopause age (1) and pulse rate (1).
Our results had similarities and differences with the results of Hirata et al. When we compared the gene and phenotype association patterns with those shown by Hirata et al., we were also able to observe many similar association patterns. For example, we observed that six out of seven hematological traits with more than one significant association signal were associated with any of the HLA-B and HLC-C genes. This confirmed the abundant associations between the two genes and hematological traits reported in Hirata et al.’s study. Their study showed that 9 out of 12 hematological traits with more than one association signal were associated with any of the two genes. We found that for both studies, cervical cancer and hematological traits ranked at the top in MHC region-wide heritability. However, not all associations were similar. For example, the previously reported association of body mass index and HMGA1 was not observed in our case.
The difference in the results between our study and Hirata et al.’s may have been caused by several factors. First, we used a different reference panel from the panel Hirata et al. used. We used a merged HLA reference panel from the Chinese reference panel and the Pan-Asian panel, but Hirata et al. used population-specific NGS-based reference panel. In addition, the properties of the cohort may be different because KoGES used different cohort recruitment strategies from BBJ. The BBJ data is based on the medical records of nationwide hospitals, but KoGES consists of community dwellers and participants recruited from the national health examinee registry. In addition, the actual definitions of the phenotypes could have had subtle differences.
Our study has several limitations. First, we did not consider interactions in the MHC region. There can be two types of interactions in HLA: the non-additive effect of the two alleles of the same HLA gene, and gene–gene interactions between HLA genes or between HLA gene and non-HLA gene. Hu et al. showed that non-additive effects are prevalent in the MHC region in their analysis of type 1 diabetes (58). Although we did not perform interaction analysis for simplicity, future investigations may find interesting non-additive effects in MHC PheWAS. Second, we only analyzed the MHC region and did not consider other immune-related regions such as the KIR region. KIR is a similarly complex region, for which specialized imputation approaches are developed (59). Further studies will be needed to find pleiotropic associations in the KIR region and their interactions with MHC. Third, due to the complex LD structure of MHC region, we were not able to distinguish whether the observed association was a direct association or an indirect association caused by one or more nearby variants.
Our imputation consisted of two steps: HLA imputation using CookHLA (15), and SNV and indel imputation using the standard imputation pipeline. The drawback of this strategy was that a post-process was required to merge the two types of imputed variants after imputation. Ideally, it would be more convenient to impute both HLA and SNV using a single reference panel covering all variants. However, this single-step approach may degrade the HLA imputation accuracy because we cannot use CookHLA, as CookHLA currently only imputes HLA but not SNVs. CookHLA implements several ideas to maximize accuracy for HLA, and thus simply using the standard imputation pipeline for HLA reference markers may not be as accurate as CookHLA.
In sum, we presented the second largest MHC PheWAS in the East Asian population, which showed both consistent association patterns to previous studies and novel associations unique to the Korean population thus far. We expect that as more population-specific large-scale MHC PheWAS are conducted, the genetic architecture of MHC will be well dissected into population-shared and population-specific components, and the direct association common to all ethnicities will be better fine-mapped.
Materials and Methods
Cohort of data resource
We used as a data resource the KoGES (13), a large cohort study initiated by the Korea National Institute of Health (KNIH) of the Korea Disease Control and Prevention Agency (KDCA). In our analysis, we used three population-based cohorts included in KoGES: HEXA (Health Examinee) cohort (N = 99 159 (34 890 male/64 269 female), mean age = 53.3), CAVAS cohort (N = 18 907 (7085 male/11 822 female), mean age = 58.6) and community-based cohort (Ansung and Ansan) (N = 7607 (3672 male/3935 female), mean age = 52.1). The total number of samples was 125 673 (45 647 male/80 026 female). We note that, contrary to its name, CAVAS is a population-based cohort that is similar to the other two cohorts. That is, CAVAS was not designed to recruit participants with specific cardiovascular diseases. The three KoGES cohorts shared core questionnaires and examination items based on the standardized protocols (13). Examples of core variables include anthropometric, biochemical, clinical measurements, socio-demographic status, lifestyle-related questions and disease history. In addition to the core variables, each cohort collected cohort-specific variables including various ancillary measurements such as oral glucose tolerance test, carotid intima-media thickness test, pulmonary function test, bone mineral density, and oriental medicine study. In this study, only core variables were used for association analysis. The participants were questioned by trained interviewers regarding their socio-demographic status. The list of questionnaires can be downloaded from the KDCA website (http://www.kdca.go.kr/contents.es?mid=a40504010000).
Phenotype curation from KoGES dataset
Detailed methods for how we defined phenotypes using the KoGES questionnaire can be found in Supplementary Material, Table S1. We excluded phenotypes that were available for fewer than 10 000 individuals. In total, we constructed a dataset of 82 phenotypes (47 binary and 35 quantitative traits). For case/control traits, we classified individuals as cases if he or she reported as having suffered from the disease or having received treatment for the disease. We excluded males in the analysis of breast cancer and cervical cancer. Likewise, we excluded females in the analysis of prostate cancer. We excluded individuals affected by diabetes, hyperlipidemia, hypertension, allergic disease, colon polyps or rheumatoid arthritis from the control group since these diseases were known to be associated within the MHC region (12,60). For quantitative traits, we excluded individuals with extreme values outside the range of three standard deviations (>μ + 3σ or < μ-3σ) as they may be attributed to erroneous measurements during the investigation.
Genotype
Genotype data was obtained using the KBA (14), a microarray designed by KNIH. It contains tagging SNVs that maximize genomic coverage based on the Korean genomic structure. It comprises >833 000 markers, including >247 000 rare-frequency or functional variants found from >2500 sequencing data in Koreans. Genotypes were called by batches with about 3000–8000 samples with the consideration of the recruitment site. Samples were excluded if their gender was inconsistent, their call rate was low (<97%), excessive heterozygosity was observed, or they were one of the outliers based on principle component analysis results. After removing low-quality samples, we further excluded 2nd-degree relatives to contain only independent samples. We also excluded 199 samples who withdrew the consent to participate. Finally, we obtained data of 125 673 samples. We excluded low-quality SNVs as follows. We checked if they are poorly clustered in the SNPolisher analysis, their missing rates were >5% or their Hardy–Weinberg Equilibrium P-values were <1 × 10−6. Detailed QC criteria regarding genotype data from the array are described elsewhere (61). For imputing variants in the core MHC region (29–34 Mb) using HLA reference panel, we used 5292 SNVs genotyped on the KBA. For imputing variants in the extended MHC region (28–35 Mb) using a merged reference panel of 1000 Genomes project phase 3 and Korean Reference Genome (N = 397) (14), we used 8708 SNVs within the extended MHC region considering a 1 Mb buffer region (26–36 Mb) genotyped on the KBA.
HLA and SNV imputation of GWAS data
Imputation of GWAS data was separately done for HLA variants within the core MHC region (29–34 Mb) and other SNVS in the extended MHC region (28–35 Mb). For the core MHC region, we constructed a merged HLA reference panel (16) (N = 10 873) from the Chinese reference panel (17) (N = 10 386) and the Pan-Asian panel (18,19) (N = 487) after excluding samples whose information for both HLA alleles was not available. Then, we imputed variants using CookHLA (15) with this reference panel. Next, we used HATK (20) to define binary markers that correspond to the presence or absence of HLA alleles, amino acid residues at an amino acid position and alleles at an exonic multiallelic SNV in the MHC region. In addition, considering the strong LD structure within the MHC region, we phased haplotypes using Beagle (62) in order to obtain more accurate results for the conditional analysis. For the SNVs in the extended MHC region, we used IMPUTE4 (21) with the merged reference panel of 1000 Genomes project phase 3 and Korean Reference Genome (N = 397). Next, we used VCFtools (63) to convert the IMPUTE4 (21) output file of VCF format to a PLINK binary format (64,65). In the final step, we merged the two datasets. Before conducting PheWAS, we applied post-imputation QC filtering to markers. We filtered out variants of MAF < 1% using PLINK(64,65). For variants from the 1000 Genome reference panel, we applied additional stringent QC criteria (Hardy–Weinberg Equilibrium P-value ≥1.0 × 10−7 and imputation score ≥ 0.8). Finally, after QC filtering, 119 four-digit alleles, 349 amino acid polymorphisms of HLA genes and 70 284 SNVs and indels remained.
PheWAS of phenotypes and variants in the entire MHC region
Our regression model included age, sex, type of cohort and the top four principal components (PCs) as covariates. We included PCs to correct for potential population stratification in our dataset. We calculated PCs using FlashPCA2 (66) with about 40 K pruned subset of SNVs.
We applied a logistic regression model for testing binary traits and a linear regression model for quantitative traits by using generalized linear models implemented in the Python statsmodels package (67). For amino acid polymorphisms, which can be considered multiallelic markers, we additionally conducted an omnibus test to derive the significance of the overall effect of residues on the phenotypes. In an omnibus test, the P-value is obtained from a log likelihood-ratio test for the likelihood difference between the null model and the alternative model, which follows a χ2 distribution with n − 1 degrees of freedom (where n is the number of alleles). For other variants such as SNVs, indels and four-digit allele of HLA gene, we obtained P-values of the effect of the presence of the allele. We used a common GWAS threshold of P < 5 × 10−8 to determine the significance of associations.
To find further independent association signals for a phenotype with association signals satisfying the threshold, we conducted stepwise conditional regression analysis. In each step of the conditional analysis, to offset the effect of the variant with the highest association signal in the preceding step, we additionally included the variant as a covariate in our regression model. For variants on an HLA gene or variants in strong LD (r2 ≥ 0.7) with any polymorphism of the HLA gene, all of the four-digit alleles of the HLA genes were included as covariates to completely rule out the effect of the gene (12,68,69). The code for this association test is available at a separate GitHub repository (https://github.com/ch6845/GAT).
Gene assignment to the top association from PheWAS and novelty determination of significant loci
We assigned candidate genes to the most significantly associated variant by adopting criteria from a previous study (12). We mapped a variant to a gene using the following priorities: (1) if the variant is located on a gene, the gene was assigned to the variant; (2) if the variant is in moderate LD (r2 ≥ 0.5) with a variant on an HLA gene, the HLA gene was assigned to the variant; (3) if the variant is in strong LD (r2 ≥ 0.8) with a variant on a non-HLA gene, the non-HLA gene was assigned to the variant; (4) in other cases, the nearest gene from the variant was assigned to the variant. We utilized Ensembl gene database for obtaining the positions of genes (70,71).
To check for the novelty of our findings, we compared our results with previously reported associations through a thorough search of published literature. An association was regarded as ‘known’ if it was located within 500 kb from the variants previously reported to be associated with the identical trait. In our analysis, 33 traits showed 69 signals above significance threshold. After matching our result with the report of the same phenotypes, 56 signals were found to have been previously reported.
Heritability and genetic correlation analysis in the MHC region
We estimated the heritability accounted for by variants in the MHC region by using the Haseman–Elston regression in GCTA software (72), a momentum-based method, which has been used previously (12). This method is more computationally efficient than REML estimation for estimating heritability (73) and was doable for our large sample-size study. To convert the estimated heritability value of binary traits to liability-scale heritability, we needed disease prevalence data (74). We estimated disease prevalence from the case and control counts in the KoGES dataset as the study did not selectively recruit individuals affected by a specific disease.
We additionally estimated genetic correlation among phenotypes by conducting a bivariate analysis with two genomic relatedness matrices (GRMs). We used the Haseman–Elston regression implemented in GCTA software for this analysis. In this way, we obtained a 2D matrix where each cell represents genetic correlation of two phenotypes. We visualized the relationship among traits on a graph in which each node represents a phenotype and each edge represents a genetic correlation between traits. We only displayed the top 3% of edges sorted by the absolute values of z scores of r and excluded edges of |r| < 0.3. The nodes were positioned using the Fruchterman–Reingold algorithm implemented in the NetworkX Python package (75), by using the absolute value of z score of r as weights (75).
URL/Software
All of the GWAS summary statistics were publicly posted on the KBA project website (https://www.koreanchip.org/downloads).
KoGES http://nih.go.kr/contents.es?mid=a40504010000.
Full Analysis Pipeline https://github.com/ch6845/MHC_Korean_association.
HLA Analysis Toolkit (HATK) https://github.com/WansonChoi/HATK.
Plink https://www.cog-genomics.org/plink/.
Beagle https://faculty.washington.edu/browning/beagle/beagle.html.
VCFtools https://vcftools.github.io/.
NetworkX https://networkx.github.io/.
The NHGRI-EBI GWAS Catalog https://www.ebi.ac.uk/gwas/home.
Acknowledgement
Genotype data were provided by the Collaborative Genome Program for Fostering New Post-Genome Industry (3000–3031b).
Conflicts of Interest statement. Buhm Han is the CTO of Genealogy Inc. No other authors have any competing interests.
Funding
National Research Foundation of Korea (2019R1A2C2002608) funded by the Korean government, Ministry of Science, and ICT; the Creative-Pioneering Researchers Program funded by Seoul National University; the Korea National Institute of Health (2019-NG-054-02).
References
Klarin, D., Damrauer, S.M., Cho, K., Sun, Y.V., Teslovich, T.M., Honerlaw, J., Gagnon, D.R., DuVall, S.L., Li, J., Peloso, G.M et al. (
Ligthart, S., Vaez, A., Hsu, Y.H., Inflammation Working Group of the, C.C., Pmi Wg, X.C.P., LifeLines Cohort, S., Stolk, R., Uitterlinden, A.G., Hofman, A., Alizadeh, B.Z. et al. (
Shungin, D., Winkler, T.W., Croteau-Chonka, D.C., Ferreira, T., Locke, A.E., Magi, R., Strawbridge, R.J., Pers, T.H., Fischer, K., Justice, A.E. et al. (
Giri, A., Hellwege, J.N., Keaton, J.M., Park, J., Qiu, C., Warren, H.R., Torstenson, E.S., Kovesdy, C.P., Sun, Y.V., Wilson, O.D. et al. (
Author notes
Chanwoo Kim and Young Jin Kim authors contributed equally.