The major histocompatibility complex (MHC) containing the classical human leukocyte antigen (HLA) Class I and Class II genes is among the most polymorphic and diverse regions in the human genome. Despite the clinical importance of identifying the HLA types, very few databases jointly characterize densely genotyped single nucleotide polymorphisms (SNPs) and HLA alleles in the same samples. To date, the HapMap presents the only public resource that provides a SNP reference panel for predicting HLA alleles, constructed with four collections of individuals of north-western European, northern Han Chinese, cosmopolitan Japanese and Yoruba Nigerian ancestry. Owing to complex patterns of linkage disequilibrium in this region, it is unclear whether the HapMap reference panels can be appropriately utilized for other populations. Here, we describe a public resource for the Singapore Genome Variation Project with: (i) dense genotyping across ∼9000 SNPs in the MHC; (ii) four-digit HLA typing for eight Class I and Class II loci, in 96 southern Han Chinese, 89 Southeast Asian Malays and 83 Tamil Indians. This resource provides population estimates of the frequencies of HLA alleles at these eight loci in the three population groups, particularly for HLA-DPA1 and HLA-DPB1 that were not assayed in HapMap. Comparing between population-specific reference panels and a cosmopolitan panel created from all four HapMap populations, we demonstrate that more accurate imputation is obtained with population-specific panels than with the cosmopolitan panel, especially for the Malays and Indians but even when imputing between northern and southern Han Chinese. As with SNP imputation, common HLA alleles were imputed with greater accuracy than low-frequency variants.
The major histocompatibility complex (MHC) is a region of ∼4 Mb in size on chromosome 6 of the human genome that spans over 160 genes including the classical human leukocyte antigen (HLA) Class I and Class II genes. With a significantly higher single nucleotide polymorphisms (SNP) density than most regions, the MHC is among the most polymorphic regions in the human genome (1), and exhibits considerable diversity between populations (Supplementary Material, Table S1). With a significant fraction of the HLA genes encoding proteins that are involved in the immune system and self versus non-self-autoimmune responses (2), the degree of HLA matching is an important predictor of transplant rejection and genome-wide association studies have implicated the HLA genes in numerous infectious and autoimmune diseases (3–10). Several HLA alleles have also been identified to be strongly associated with increased susceptibility to adverse reactions to particular drugs (11,12), such as the B*57 : 01 allele for abacavir hypersensitivity (13) and the B*15 : 02 allele for carbamazepine-induced Stevens–Johnson syndrome (14).
Despite the clinical importance of identifying and matching the HLA types, there are very few databases that are dedicated to characterizing the HLA alleles across multiple populations. This is partly due to the high costs of determining the HLA types, especially for high-resolution typing across multiple HLA loci that are necessary for bone marrow or stem cell transplants. The prospect of statistically inferring the HLA alleles from genotype data is promising and cost-effective, and several statistical methods have been developed to leverage on HLA reference panels built by jointly characterizing SNPs and HLA loci for the same set of samples (15–19) such as the International HapMap Project (HapMap) (20) and the MHC Working Group of the Type 1 Diabetes Genetics Consortium (21). In the second phase of the HapMap project, dense genotyping yielded a database of almost 2.4 million SNPs and HLA typing was performed across six HLA loci (-A, -B, -C, -DQA1, -DQB1, -DRB1) for 301 samples from four ancestry groups: (i) Utah residents with northern and western European ancestry; (ii) Han Chinese in Beijing, China; (iii) Japanese in Tokyo, Japan and (iv) Yoruba in Ibadan, Nigeria.
Statistical imputation of the HLA types from genetic data fundamentally relies on matching the patterns of linkage disequilibrium (LD) that are found in the target data with those present in the reference panel, and variations in the extent or patterns of such genetic correlation can confound this analysis (22). Several studies have reported complex patterns of LD across the MHC region that differ significantly between populations (23–26), with reports of differential evidence of positive selection at specific HLA loci even between closely related populations such as northern and southern Han Chinese (27). It is thus unlikely that the reference panels built with the four populations in the HapMap resource can be universally representative of other global populations.
Here, we introduce another HLA resource built from the Singapore Genome Variation Project (SGVP) (28) which genotyped 268 subjects from three population groups in Southeast Asia on the Illumina 1 m and Affymetrix 6.0 microarrays, yielding a genotype database of ∼9000 SNPs in the 25–35 Mb region in the MHC. High-resolution sequence typing using sequence-based typing and taxonomy-based sequence analysis was performed across three loci in Class I (-A, -B, -C) and five loci in Class II (-DPA1, -DPB1, -DQA1, -DQB1, -DRB1), respectively, to produce four-digit HLA alleles for each of the 268 subjects, which permitted the construction of three additional HLA reference panels from the 96 southern Han Chinese, 89 Southeast Asian Malays and 83 Tamil Indians from Singapore. Using an independent set of samples from the three population groups, we compared the accuracy of imputing the HLA types using the HapMap and the SGVP samples. In particular, we combined all four HapMap panels into one cosmopolitan reference panel and assessed how well such a cosmopolitan panel will perform in imputing the HLA alleles for populations such as the Malays and the Indians that are considerably different from the HapMap populations. We also evaluated the accuracy of the imputation in the southern Han validation samples when imputed with either the HapMap northern Chinese reference panel or the SGVP southern Chinese panel. We expect our findings will provide a benchmark for deciding on the appropriate HLA reference panels to impute against, especially for Asian populations with ancestry similar to those of southern Chinese, Malays and Tamil Indians. The SGVP HLA resource is made publicly available at http://www.statgen.nus.edu.sg/~SGVP/hla.html.
The SGVP created a SNP resource for 96 southern Han Chinese (CHS), 89 Southeast Asian Malays (MAS) and 83 Tamil Indians (INS) from Singapore by genotyping each sample on the Illumina 1 m and Affymetrix 6.0 microarrays, producing at least 1.5 million SNPs for each population group. This included ∼9000 SNPs that were present in all three populations in the MHC region between 25 and 35 Mb on chromosome 6. HLA typing was performed in all 268 subjects to yield at least four-digit resolution HLA alleles for HLA-A, HLA-B and HLA-C in Class I, and HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1 and HLA-DRB1 in Class II. By phasing the SNP genotype data together with the HLA alleles in each population separately with BEAGLE, three population-specific haplotype reference panels were created for the purpose of imputing HLA alleles at these eight loci. We similarly constructed four reference panels for the populations in Phase 2 of the HapMap by integrating the resource from at least 16 000 SNPs in the MHC with six HLA loci (-A, -B, -C, -DQA1, -DQB1, -DRB1) in the Caucasian (CEU), Nigerian African (YRI), Han Chinese (CHB) and Japanese (JPT) samples. A cosmopolitan reference panel was constructed by combining all four population-specific panels for HapMap.
Comparing the HLA alleles between SGVP and HapMap
We observed that HLA-B is the most polymorphic locus out of the eight HLA loci, where the number of alleles ranged from 30 in CHS to 40 in MAS. This concurred with the findings from HapMap that HLA-B was the most polymorphic loci, with 28 alleles in CHB and 56 alleles in the combined cosmopolitan panel. The number of HLA alleles present across the eight HLA loci was similar between the CHS and MAS samples, while the INS samples exhibited more alleles in all loci except at DPA1 and DRB1 (Table 1). However, there were significant differences in the distributions of the alleles within each locus (Fig. 1 and Supplementary Material Fig. S1 and Table S2). For example, while A*11:01 was the most frequently occurring allele in HLA-A in all three SGVP populations, different alleles were found to be most common in HLA-B with B*40:01 in CHS at a frequency of 20.4%, B*15:02 in MAS at 15.5% and B*40:06 in INS at 12.5%. In fact, other than DQB1 where DQB1*06:01 was most common in all three populations, different alleles were observed to be most frequent in the three populations for the remaining five HLA loci. The number of alleles across HLA-A, HLA-B, HLA-C and HLA-DQB1 was consistent between the SGVP and HapMap populations, although there were less number of alleles reported in the SGVP populations for HLA-DQA1 and HLA-DRB1 due to the greater extent of missing or ambiguous HLA data in the SGVP samples (Table 1). No comparison could be made for DPA1 and DPB1 as HapMap did not survey these two loci.
|No. of samples||96||89||83||54||301||20||20||20|
|No. of SNPs||10 391||10 381||10 869||16 012||18 015||26 376||26 376||26 376|
|HLA locus||Number of distinct four-digit HLA alleles|
|No. of samples||96||89||83||54||301||20||20||20|
|No. of SNPs||10 391||10 381||10 869||16 012||18 015||26 376||26 376||26 376|
|HLA locus||Number of distinct four-digit HLA alleles|
aRefers to a cosmopolitan panel obtained by combining the four population-specific HLA reference panels for CEU, CHB, JPT and YRI.
bNo information is available for the HapMap samples as these two loci were not surveyed.
cOnly 47 CHS, 66 MAS and 88 INS chromosomes had unambiguous four-digit HLA alleles for this locus.
dOnly 107 CHS, 91 MAS and 103 INS chromosomes had unambiguous four-digit HLA alleles for this locus.
eOnly 20 CHS, 27 MAS and 31 INS chromosomes had unambiguous four-digit HLA alleles for this locus.
We investigated the diversity of the Class I HLA genes across the three SGVP populations (CHS, MAS and INS) and three HapMap populations (CEU, CHB and JPT) by calculating the multi-allelic FST between every pair of populations. The choice of the populations was motivated by the interest to compare between: (i) the southern Chinese (CHS) with the northern Chinese (CHB) and another East Asian population (JPT); (ii) the Europeans (CEU) with the Tamil Indians (INS) given previous reports that the Indians were genetically closer to the Europeans than to the Chinese and Malays. The results of the FST estimation at each Class I gene were illustrated with a population dendrogram obtained through agglomerative hierarchical clustering (Fig. 2). While the relationships between the six populations were broadly consistent across all three loci, subtle variations existed that may be insightful to the representativeness, or lack thereof, of the HapMap HLA data for the SGVP populations. For example, while the two Han Chinese cohorts were most homogeneous of the six populations for HLA-B and HLA-C (FST = 0.19 and 0.59%, respectively, Supplementary Material, Table S3), the southern Chinese (CHB) was found to be closer to the Malays (MAS, FST = 0.69%) than to the northern Chinese (FST = 1.22%) at HLA-A. The Tamil Indians (INS) were generally found to be closer to the Europeans (CEU) than Chinese and Malays (FST between CEU and INS for HLA-A = 7.35%, HLA-B = 5.31%, HLA-C = 5.48%, compared with FST between CEU and CHS for HLA-A = 13.03%, HLA-B = 7.49%, HLC-C = 6.97%) although the magnitude of the FST estimates suggested that considerable diversity likely existed in the HLA profiles of these Class I genes between Europeans and Indians (Supplementary Material, Table S3).
Tagging HLA alleles with SNPs
Joint phasing of the HLA alleles and the SNPs allowed the assessment of the correlation between particular combinations of SNP variants with the HLA alleles, thus quantifying the ability for particular sets of SNPs to tag the HLA alleles. In general, the HLA alleles were tagged by SNPs that were located in the vicinity of the respective genes (Fig. 3), with 46.8% (CHS),
48.3% (MAS) and 56.5% (INS) of the HLA alleles exhibiting a correlation of r2 ≥ 0.80 with up to three SNPs (Supplementary Material, Table S4). The SNPs identified exhibited different tagging abilities in terms of: (i) the extent of correlation between the same SNP identified to be tagging a particular HLA allele in all three populations; (ii) the set of SNPs identified to be associated with the HLA allele in each population. An example of the former is the allele HLA-B*57:01 which was perfectly tagged by the guanine allele of rs2395029 (r2 = 1) in CHS and INS, but the same SNP only had a tagging efficiency of r2 = 0.746 in MAS and required an additional SNP rs4418214 for the guanine–cytosine haplotype of the two SNPs to achieve perfect tagging; an example of the latter is allele HLA-B*44:03 which is tagged (with r2 ≥ 0.80) by seven SNPs in CHS but was instead tagged by an entirely discordant set of 27 SNPs in MAS.
Predicting HLA types in independent sample sets
To assess the performance of the SGVP reference panels in imputing the HLA alleles, we performed a validation experiment with 120 samples from the Singapore Integrative Omics Study (iOmics), where 40 individuals from each of the same three population groups in the SGVP have been (i) genotyped on the Illumina HumanOmni2.5 and Illumina Exome microarrays and (ii) assayed for the same eight HLA loci in Class I and Class II. This presented a denser panel of 23 451 SNPs across the 25 and 35 Mb region of chromosome 6, which we used as input to predict the alleles at the eight HLA loci for the 120 samples using the population-specific and cosmopolitan reference panels that we have constructed. The predicted four-digit HLA alleles were then compared against the experimentally determined four-digit HLA types to evaluate the error rates of the imputation. In silico HLA alleles were similarly predicted using the HapMap reference panels, where we considered the CHB and cosmopolitan panels for imputing the 40 southern Chinese samples, and only the cosmopolitan panels for imputing the Malays and the Indians samples.
For the southern Chinese samples, the HapMap cosmopolitan panel obtained similarly or more accurate four-digit imputation as the population-specific CHS panel for six loci (-A, -B, -C, -DQA1, -DQB1, -DRB1) (Table 2). No comparisons can be made for DPA and DPB as HapMap did not assay these two Class II loci. Comparing between the SGVP and HapMap cosmopolitan panels, the SGVP panel produced more accurate imputation for all three Class I loci, especially at HLA-B where the error rate decreased from 35% with the HapMap panel to 23% with the SGVP panel. As expected, the best imputation performance was observed at the Grand cosmopolitan panel that combined both the SGVP and HapMap cosmopolitan panels. The northern Chinese panel (CHB) was inferior to the southern Chinese panel (CHS) for all three Class I loci but outperformed the CHS panel for the three Class II loci although this was likely an effect of the small number of samples with available data to construct the SGVP panels at -DQA1, -DQB1, -DRB1 (see Table 1). For the Malays, the HapMap cosmopolitan panel was clearly inappropriate with error rates of 55, 65 and 14% for HLA-A, HLA-B and HLA-C, respectively, compared with the much lower error rates of 20, 34 and 4% when imputed using the MAS panel (Table 2). The SGVP cosmopolitan panel produced even better performance with error rates of 11, 28 and 4%, respectively, although surprisingly the Grand cosmopolitan panel produced higher error rates than the SGVP panel. The population-specific INS panel yielded better performance than the HapMap cosmopolitan panel for the polymorphic HLA-B (25 versus 48%, respectively), although the HapMap cosmopolitan panel had lower error rates for both HLA-A and HLA-C (11 and 16% for HapMap, 18 and 29% for INS, respectively). The SGVP panel offered the lowest error rates at both HLA-A and HLA-B, although the Grand cosmopolitan panel performed the best for HLA-C. The HapMap or Grand cosmopolitan panels consistently performed better than the population-specific or SGVP panels for the three Class II loci, since the number of alleles in the SGVP populations for these loci was considerably smaller than the number of alleles in the target iOmics data set due to poor assays for most of the samples at these three loci.
|HLA locus||iOmics southern Chinese||iOmics Malays||iOmics Indians|
|HLA locus||iOmics southern Chinese||iOmics Malays||iOmics Indians|
The error rate corresponding to the reference panel(s) with the most accurate imputation for each HLA locus in each population is highlighted in bold. The error rates were calculated using 40 iOmics samples in each of the three population groups.
aSGVP refers to combined panel constructed by merging the three SGVP panels for Singapore Chinese (CHS), Singapore Malay (MAS) and Singapore Indian (INS).
bCosmo refers to the combined panel constructed by merging the four HapMap panels for Europeans (CEU), northern Chinese (CHB), Japanese (JPT) and Nigerian Africans (YRI).
cGrand refers to the combined panel constructed by merging SGVP1 and Cosmo2.
dHapMap did not assay this locus.
Imputation accuracy by HLA allele frequency
In order to understand how imputation accuracy varies with the frequency of the HLA alleles in the populations, we binned all the HLA alleles observed across the eight loci in the iOmics samples into two categories: (i) low-frequency, defined as a HLA allele frequency of between 0 and 10% and (ii) common, defined as a HLA allele frequency of >10%. Regardless of the choice of the reference panel, common HLA alleles were imputed with greater accuracy than low-frequency variants, although the accuracy was significantly higher when the population-specific panels were used instead of the HapMap cosmopolitan panel (Fig. 4). The gain in accuracy was particularly greater for common alleles than for low-frequency alleles, and for Malay and Tamil Indian samples that were considerably distinct from the makeup of the HapMap cosmopolitan panel.
We have introduced a genomic resource that integrates data from dense SNP genotyping with high-resolution four-digit HLA allelotyping at three HLA Class I loci and five Class II loci, for three populations of southern Han Chinese, Southeast Asian Malay and Tamil Indian origins. This is developed with SNPs located in the Illumina1M and Affy6.0 microarrays, and has enabled three population-specific reference panels to be constructed for the purpose of imputing HLA types with SNP data, especially for the Malays and Tamil Indians for which there are currently no suitable reference panels. However, only the reference panels for the three Class I loci (-A, -B, -C) and two loci in Class II (-DPA1, -DPB1) are suitable for use, as poor HLA sequencing results meant that there were significantly higher levels of missing allele calls for -DQA1, -DQB1 and -DRB1. When we compared the performance of these population-specific panels against that of a cosmopolitan reference panel built from combining all four HapMap Phase 2 populations, they consistently delivered more accurate imputation in an independent set of samples from the three population groups. However, the best imputation performance was obtained with a grand cosmopolitan panel obtained from using all available samples from the SGVP and HapMap. As with SNP imputation, HLA alleles that are more common in the population are imputed with greater accuracy, whereas low-frequency alleles are more likely to be wrongly imputed regardless of the choice of reference panel.
Numerous studies evaluating the use of SNP imputation in diverse populations have indicated that: (i) increasing the size of the reference panel improved SNP imputation accuracy, especially for non-African target populations (29–33) and (ii) a cosmopolitan reference panel can often be relevant for imputing SNPs in populations not present in the makeup of the cosmopolitan panel. The former is especially true for East Asian populations such as the Han Chinese and Japanese, where across the genome the set of distinct haplotypes is usually smaller due to the relative homogeneity between these populations. Therefore, one naturally expects the CHB samples from HapMap may serve as an appropriate reference panel to impute the HLA types for other Han Chinese samples. However, the southern Han Chinese samples experienced higher rates of errors when imputed against the northern Han Chinese, or even with the cosmopolitan panel in the case of HLA-B. Also, the HapMap and the Grand cosmopolitan panels were clearly inappropriate to infer the HLA types for the new populations of the Malays and Tamil Indians, despite the inclusion of the SGVP data in the Grand cosmopolitan panel. In addition, the iOmics samples used to evaluate the reference panels had >23 000 SNPs between 25 and 35 Mb, while the HapMap reference panel was built using ∼13 000 SNPs compared with the SGVP panel of ∼9000 SNPs. Despite the two disadvantages against the SGVP panel (less SNPs and smaller sample size per population-specific panel), the population-specificity of the SGVP panels still delivered better imputation performance, and this indicates that having the appropriate reference panel is particularly important in HLA imputation.
Several HLA alleles have been demonstrated to be highly predictive of adverse drug reactions (ADRs), particularly those related to immunoallergic reactions. From a public health perspective, understanding the burden of predictable ADRs in the population begins with measuring how common the predicting triggers are in a target group, which for HLA-associated ADRs will be the frequencies of the particular HLA alleles in the population (Table 3). By knowing how common the associated HLA alleles are in the population, as well as the positive and negative predictive values for the presence of the specific HLA alleles (11), health economic modeling can be undertaken to evaluate the cost-effectiveness of prospective HLA screening. For example, information from this resource for the frequencies of the HLA-B*15:02 allele in the three population groups in Singapore was used in a cost-effectiveness analysis of a public health policy around genomic screening to prevent Stevens–Johnson syndrome induced by the anti-epileptic drugs carbamazepine and phenytoin. The economic evaluation revealed that screening was cost-effective to the healthcare system for the Chinese and Malays, but not for the Tamil Indians, as the frequency of the HLA-B*15:02 allele was present in only 1.9% of the Indians, when compared with 9.1 and 15.5% in the Chinese and Malays, respectively (34).
|HLA allele||Drug||Adverse reactiona||Allele frequency (%)||Predictive valueb (%)|
|HLA allele||Drug||Adverse reactiona||Allele frequency (%)||Predictive valueb (%)|
aDILI, drug-induced liver injury; HSS, hypersensitivity syndrome; SJS, Stevens–Johnson syndrome; TEN, toxic epidermal necrolysis.
bPredictive values obtained from Becquemont (11).
Our study has augmented the SNP resource of the SGVP with high-resolution HLA data for three additional populations in Asia. To date, this is the first comprehensive survey of HLA Class I and Class II loci in an Austronesian group (Malay) and a South Asian population. Given the multi-allelic and diverse nature of the MHC, we do not anticipate that the imputation reference panels that we have introduced will be applicable to other Austronesian and South Asian populations that are not Malays or Tamil Indians, respectively. Indeed, we caution against the use of these reference panels for populations other than those defined to be southern Han Chinese, Southeast Asian Malays or Tamil Indians. We envisage our panels will be a timely complement to those from the HapMap, as the genetics community begins to evaluate the biological significance of disease association findings located within the HLA regions, which will require progressing from SNP association to association with classical HLA alleles, and finally to identify the specific amino acid changes involved (35,36).
MATERIALS AND METHODS
Sample collection, genotyping and HLA typing
The SGVP surveyed 100 individuals from each of southern Han Chinese, Southeast Asian Malay and Tamil Indian ancestries. The subjects were randomly and anonymously chosen from a study on inter-population variation to drug response. Self-reported gender and population membership are available for each of the 300 samples, with the latter determined through self-reports that all four grandparents belong to the same population group. Out of the 300 samples identified for SGVP, 99 Chinese, 98 Malays and 95 Indians were genotyped on both the Illumina HumanHap1M and Affymetrix SNP 6.0 microarrays, of which 96 Chinese (CHS), 89 Malays (MAS) and 83 Indians (INS) across, respectively, 1 584 040, 1 580 905 and 1 583 454 autosomal SNPs were retained for further analyses after quality assessments. A set of 9766 SNPs that are present in all three populations in the 25–35 Mb region of the MHC is extracted to develop the HLA reference panels in this study. Details of the SNP-level quality checks are available in the original SGVP publication (28). High-resolution sequence-based HLA typing was performed for the three Class I loci (-A, -B, -C) and five Class II loci (-DPA1, -DPB1, -DQA1, -DQB1, -DRB1) using a sequence-based typing method with taxonomy-based sequence analysis, with a target resolution of at least four digits (37,38).
Ethical consents for the SNP and HLA typing were obtained from the Institutional Review Board of the National University of Singapore, and informed consent was obtained from all participants for the inter-population study on genetic variability to drug response.
Data from the International HapMap Project
Phase 2 of the International HapMap Project surveyed ∼3.1 million SNPs in: (i) 30 parent-offspring trios of northern and western European ancestry from the Centre d′Etude du Po,ymorphisme Humain collection (CEU); (ii) 30 parent-offspring trios of the Yoruba people from Ibadan, Nigeria (YRI); (iii) 45 unrelated Han Chinese from Beijing, China (CHB) and (iv) 45 unrelated Japanese from Tokyo, Japan (JPT) (20). The MHC data set for six HLA loci (-A, -B, -C, -DQA1, -DQB1, -DRB1) in 301 of these samples were obtained from the HapMap resource and 18 015 SNPs were available in the 25–35 Mb region of the MHC for the four population groups. The HapMap HLA resource is available at http://www.inflammgen.org.
Imputation of classical HLA alleles
To build the HLA reference panels for the populations in SGVP and HapMap, SNPs located between 25 and 35 Mb of chromosome 6 were considered although we excluded SNPs with minor allele frequency <1%, missing genotype >5%, or where the Hardy–Weinberg equilibrium P-value <10−6. All SNP allele annotations were mapped to the forward strand, and physical coordinates corresponding to hg18 were used in this study. Each of the unique four-digit HLA alleles was encoded as a biallelic marker in each subject with the entry effectively counting the number of alleles present (0, 1 or 2). Any HLA allele with frequency <0.01 was excluded and such alleles were encoded as missing in the samples. The HLA alleles at each locus were assigned a genetic position that corresponded to the center of the respective gene, except when the position coincided with the location of an actual SNP in which case the genetic position of the HLA alleles is shifted by one base. Phasing was performed with BEAGLE (36,39) on the SNP and HLA data to generate the reference panels for the populations in HapMap and SGVP. In addition, we combined the four population-specific HLA reference panels from HapMap to generate a HapMap cosmopolitan reference panel and combined the three population-specific HLA reference panels from SGVP to generate a SGVP cosmopolitan reference panel, as well as the combined cosmopolitan reference panel of SGVP cosmopolitan panel and HapMap cosmopolitan panel. In imputing the HLA alleles for a target data set possessing a similar set (or a subset) of the SNPs present in the reference panel, the binary markers at the HLA loci were encoded as missing values and BEAGLE was subsequently used to impute the outcome at these markers. This imputation process produces posterior probabilities for each possible HLA allele as well as the best-guess allele for each individual, and the posterior probabilities can be used to calculate the dosages for the respective HLA alleles. Default parameter settings for BEAGLE were used in the phasing and imputation except we assumed a larger maximum window size of 1000 consecutive markers for building the haplotype frequency model. These steps are embedded in the SNP2HLA tool (36), and the details of commands used can be found in the Supplementary Material.
LD calculation and tagSNP identification
The pairwise LD between each of the HLA alleles with surrounding SNPs was calculated using Haploview (40) applied to the phased haplotypes with the binary-encoded HLA alleles, and the extent of LD was quantified using the genetic correlation coefficient r2. In order to identify the tagging SNPs for each of the HLA alleles, we first identified the set of HLA alleles that were in perfect LD with single SNPs and performed an aggressive search for tagging haplotypes consisting of either two or three SNPs for the remaining HLA alleles with the ‘Tagger’ algorithm implemented in Haploview. If none of the one to three-marker haplotype combinations provide an r2 ≥ 0.8, the specific combination that yields the highest r2 is reported.
Multi-allelic FST calculation and population dendrogram construction
We quantified the genetic distance between two populations at each of the three HLA loci in Class I by calculating the multi-allelic FST, which is a weighted-average implementation of the standard biallelic-SNP FST metric defined by Weir and Cockerham (41) with weights defined by the pooled frequencies of the alleles across the two populations. The FST calculations were performed using Arlequin (42) for every pair of the following six populations: CHS, MAS and INS from SGVP, and CEU, CHB and JPT from HapMap. This yielded a 6 × 6 distance matrix for each of the three HLA loci in Class I (-A, -B, -C), which is subsequently used as the input to perform an agglomerative hierarchical clustering of the six populations using the agnes command in R. Briefly, each population is assigned to a separate cluster and the algorithm iterates between merging two clusters with the smallest between-cluster dissimilarity and updating the dissimilarity between the newly formed cluster with all remaining clusters. A dendrogram clustering tree is subsequently constructed to display the results of the agglomerative clustering, where the vertical coordinate where two branches join provides a measure of the dissimilarity between the two corresponding clusters.
Validation samples from the Singapore Integrative Omics Study
To assess the transferability of the SGVP and HapMap reference panels for HLA allele prediction, we identified an independent collection of 120 samples from the Singapore Integrative Omics Study (iOmics), consisting of 40 southern Han Chinese, 40 Southeast Asian Malays and 40 Tamil Indians from Singapore. As with the SGVP samples, population membership was determined by ascertaining that all four grandparents self-reported to belong to the same population group. Each of these samples has been genotyped on both the Illumina HumanOmni2.5 and HumanExome microarrays which provided a total of 23 451 SNPs across 25 and 35 Mb of chromosome 6. Each sample has also been typed for the same eight HLA loci as the SGVP samples with the same methodology as described previously.
This project acknowledges the support of the Saw Swee Hock School of Public Health, the Yong Loo Lin School of Medicine, the National University Health System, the Life Science Institute and the Office of Deputy President (Research and Technology) from the National University of Singapore. N.E.P., R.T.H.O., W.T.P. and Y.Y.T. additionally acknowledge support from the National Research Foundation Singapore (NRF-RF-2010-05).
Conflict of Interest statement. None declared.
The URLs for data and software presented herein as follows: International HapMap Project: http:/hapmap.ncbi.nlm.nih.gov/; Singapore Genome Variation Project, http:/www.statgen.nus.edu.sg/~SGVP; The HapMap HLA resource, http://www.inflammgen.org.