Development of a high-resolution NGS-based HLA-typing and analysis pipeline

The human leukocyte antigen (HLA) complex contains the most polymorphic genes in the human genome. The classical HLA class I and II genes define the specificity of adaptive immune responses. Genetic variation at the HLA genes is associated with susceptibility to autoimmune and infectious diseases and plays a major role in transplantation medicine and immunology. Currently, the HLA genes are characterized using Sanger- or next-generation sequencing (NGS) of a limited amplicon repertoire or labeled oligonucleotides for allele-specific sequences. High-quality NGS-based methods are in proprietary use and not publicly available. Here, we introduce the first highly automated open-kit/open-source HLA-typing method for NGS. The method employs in-solution targeted capturing of the classical class I (HLA-A, HLA-B, HLA-C) and class II HLA genes (HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1). The calling algorithm allows for highly confident allele-calling to three-field resolution (cDNA nucleotide variants). The method was validated on 357 commercially available DNA samples with known HLA alleles obtained by classical typing. Our results showed on average an accurate allele call rate of 0.99 in a fully automated manner, identifying also errors in the reference data. Finally, our method provides the flexibility to add further enrichment target regions.


Description of parameter selection for the automated calling algorithm
The first analyses of the HLA data were performed with the haploid cell lines described by Horton et al. (1). For each sample and locus we mapped all reads to the collection of available reference sequences and saw that the allele with the highest number of perfect alignable reads was the reference allele. For the evaluation of heterozygous samples we had to generate a valid sample set. Here we chose 20 samples for which we had Sanger-determined HLA allele data available (see Supplementary Methods Table 1 below). For these samples we performed the targeted enrichment with a bait design that also covered the Affymetrix SNP array 6.0 SNVs of the HLA region. Employing our pibase software (2) for SNP calling and HLA imputation (3), we calculated two field HLA types for that sample set (Supplementary Methods Table 1). These twenty samples and their Sanger and imputation based HLA types were the reference data with which we set up the analysis method.
Next, we ran the afore-mentioned perfect mapping count approach on the heterozygous samples. When we sorted all alleles by the number of mapped reads (descending), we found allele A at the top of the list followed by many alleles that showed high similarity to that allele A. Allele B was very rarely ranked second place. Therefore, additional parameters had to be established to identify the correct allele combination for the sample's genotype. We ended up with 5 parameters:

First parameter
As described in the Methods section of the main manuscript we removed all redundancies regarding the mapped reads. Based on these results we calculated an ideal coverage, which reflects the maximal possible coverage. The area under the curve (AUC) was calculated as the fraction of ideal coverage that was achieved. The AUC parameter was then used instead of the number of perfect alignable reads. In Supplementary Methods Figure 1a we visualized the distribution of the AUC for all possible genotypes built from all fully covered alleles. The distribution for the reference genotypes is shown in blue. One can see that the AUC of the reference genotype is usually at the upper end of the distribution.

Second parameter
For heterozygous samples we assumed to cover the SNV positions of both DNA copies with about the same number of reads. Taking this into account we calculated the read equality (REQ) and counted the read mappings that mapped to only one of both alleles of a genotype. Then we divided the minimum of both by the maximum of both and received a value between 0 and 1 (1 stands for an exact even distribution). Supplementary Methods Figure 1b shows that the read equality of the reference genotypes is again close to the upper-side of the distribution. The zero values come from the homozygous genotypes.

Third parameter
A further review of the results showed that not only the read equality is a criteria that is maximized for heterozygous samples, but also the number of allele specific mappings per base (ASM). The ditribution of ASM is visualized in Supplementary Methods Figure 2a. The values of the reference genotypes are again located at the upper side of the destribution, but less than it is for the first two parameters (Supplementary Methods Figure 1). This observation already indicates a probable minor weighting of this parameter (compared to auc and req) when we try to find the genotype that has all parameters as high as possible later.

Fourth parameter
For the 4 th parameter we calculated the number of mapped pairs per read (MPPR) for each allele of every genotype. In case the input data comes from a paired end run, we assumed that the correct allele mapping shows as many paired end mappings as possible. The distribution of these values is shown in Supplementary Methods Figure 2b. The distribution again indicates a minor weighting of that parameter.

Fifth parameter
It also turned out that the IMGT/HLA database often only has sequence information for a reduced number of exons. These are, as expected, exons 2 & 3 for class I and exon 2 for class II genes. These short cDNA references have the tendency to show high values for allele specific mappings per base. On the other hand these short cDNAs cannot explain where many of the locus specific reads stem from. To correct for such instances, we introduced the length of the mappable sequence (MSL) as parameter five. For this parameter we assumed that the longer allele version is present if we detect the sequence for this allele. The distribution of these values is shown in Supplementary Methods Figure 2c and a minor weighting is again indicated.
After all parameters were calculated for all genotypes, they were scaled between zero and one for each sample and locus separate. If a value was by definition already scaled between zero and one this step was skipped (AUC, REQ). In the next step we calculated the harmonic mean for the rescaled parameters. Like the distribution of the values already indicated, some parameters might be weighted less. For ASM, MPPR and MSL we tested the weightings 1, 0.5 and 0.1. The best results for our validation set (Supplementary Methods Table 1) could be achieved with ASM weighted 0.5, MPPR weighted 0.1 and MSL weighted 0.1. The exact formula can be found in the Methods section of the main manuscript. Figure 1: Distribution of the parameter values that were used for the automated calling algorithm. a) Histograms of the area-under-the-curve and b) read-equality. For every sample and all loci we took all the alleles that were covered by 100%, following our mapping criteria described in the publication. For each locus we build all possible genotypes of the available alleles and calculated the parameters AUC (area under the curve) and read equality (see main paper). The values were sample-and locus-specific scaled between zero and one ([0;1]). The histograms for these values are shown here. The black framed bars were generated from the whole data set, the blue bars are derived from the values calculated for the reference alleles. We see for both parameters that these values are at the upper end of the distribution. The zero values for read equality stem from the homozygous genotypes. The panel between the x-axes and axis labels show vertical lines for each value that was used to to generate the histogram. Black lines are the values for the black framed histogram and blue for the blue histogram. Figure 2: Distribution of the parameter values that were used for the automated calling algorithm. Please see legend of Supplementary Methods Figure 1 for a description of the histogram type. The parameters visualized here are a) allele specific reads per base, b) mapped pairs per read and c) length of the mappable sequence. For the allele specific reads per base one can see a tendency that the reference genotypes are concentrated at the upper side of the distribution even in instances where it is not so clear as for the AUC and the read equality. For the score calculation we weigh allele specific reads per base with 0.5. For the remaining two parameters this effect is less strong and we weight those with 0.1 only. Table 1. Reference data set that was used for the implementation of the calling algorithm. The table shows the sample set used to develop the HLA calling algorithm. The first column listst the sample identifier, the second the HLA locus for which the following columns show the allele calls. Columns 3 and 4 show both alleles determined by the in silico SNP2HLA allelle imputation. Columns 5 and 6 show the classically Sanger determined genotype. Columns 7 and 8 show the HLA calling results of our tool after manual review of the NGS based calls. The fully automated calling that was based on our feature selection and weighting showed an overall 97% concordance to these results.

HLA imputation
Sanger NGS based caling