RAbHIT: R Antibody Haplotype Inference Tool

SUMMARY
Antibody haplotype inference (chromosomal phasing) may have clinical implications for the identification of genetic predispositions to diseases. Yet, our knowledge of the genomic loci encoding for the variable regions of the antibody is only partial, mostly due to the challenge of aligning short reads from genome sequencing to these highly repetitive loci. A powerful approach to infer the content of these loci relies on analyzing repertoires of rearranged V(D)J sequences. We present here RAbHIT, an R Haplotype Antibody Inference Tool, that implements a novel algorithm to infer V(D)J haplotypes by adapting a Bayesian framework. RAbHIT offers inference of haplotype and gene deletions. It may be applied to sequences from naïve and non-naïve B-cells, sequenced by different library preparation protocols.


AVAILABILITY
RAbHIT is freely available for academic use from comprehensive R archive network (CRAN) (https://cran.r-project.org/web/packages/rabhit/) under CC BY-SA 4.0 lisence.


SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.


Introduction
Analysis of antibody repertoires by high throughput sequencing is of major importance in understanding adaptive immune responses. Our knowledge of variations in the genomic loci encoding antibody genes is incomplete, mostly due to technical difficulties in aligning short reads to these highly repetitive loci. The partial knowledge results in conflicting V-D-J gene assignments between different algorithms, and biased genotype and haplotype inference. Previous studies have shown that haplotypes can be inferred by taking advantage of IGHJ6 heterozygosity, observed in approximately one third of the population [1], [2].
Here we provide a robust novel method for determining V-D-J haplotypes by adapting a Bayesian framework, RAbHIT. Our method extends haplotype inference to IGHD and IGHV based analysis, thereby enabling inference of complex genetic events like deletions and copy number variations in the entire population. It calculates a Bayes factor, a number that indicates the certainty level of the inference, for each haplotyped gene.
More details can be found here: Gidoni

Input
RAbHIT requires two main inputs: 1. Pre-processed antibody repertoire sequencing data with heterozygosity in at least one gene 2. Database of germline gene sequences Antibody repertoire sequencing data is in a data frame format. Each row represents a unique observation and columns represent data about that observation. The names of the required columns are provided below along with a short description.
Column Name Description An example dataset is provided with RAbHIT. It contains unique naive b-cell sequences, from multiple individuals. One individual in the example data set appears twice, once with full V coverage and once partial V coverage (I5 and I5_FR2 respectively).
The database of germline sequences should be provided in a FASTA format with sequences gaped according to the IMGT numbering scheme [3]. IGHV, IGHD, and IGHJ alleles in the IMGT database (release 2018-12-4) are provided with this package (HVGERM, HDGERM, and HJGERM). We removed the following IGHV and IGHD alleles, as they are duplicated from other allele assignments (All duplicates are mentioned in IMGT). Pre-processing of the data To obtain the most reliable result we suggest following the data pre-processing steps below.
3. Infer novel alleles (TIgGER [5]). 4. Re-align the sequences with the extended V reference including the new alleles. 5. Infer the individual genotype according to the following: • Cell type: -If the dataset is from PBMCs then, to avoid memory cell influence on the results, it is important to infer clones prior to genotyping. After inferring clones, choose a representative sequence with the fewest mutations for each clone. -If the dataset is from naive B-cells then no additional reduction is needed.
• V coverage: -If the dataset sequences are with partial V coverage, first determine reliable gene and alleles, then change the non-reliable alleles' annotation and genotype. -If the dataset is from naive B-cells then no additional reduction is needed.
6. Re-align the sequences with the individual personal genotype as a reference. 7. It is recommended to filter the sequences to only V genes with ≤ 3 mutations and no mutations in D gene. 8. For best haplotype inference results, ideally the dataset final unique VDJ count should be more than 2000 sequences.

Running RAbHIT
To load RAbHIT we'll run the following: The functions provided by this package can be used to perform any combination of the following:  Haplotype with D2-21 as anchor One of the advantages of RAbHIT, is the ability to infer haplotype by other anchor genes than J6. RAbHIT offers utilizing any gene as anchor, however, we recommend that the fraction of the least frequent allele of the anchor gene will be above 0.3. So far only J6, D2-8, and D2-21 has been proven to infer the heavy chain loci correctly. Here, we chose a single individual from our example dataset with heterozygousity in J6 and D2-21, inferred haplotypes according to both anchors, compared them using the hapDendo function, and calculated the Jaccard distance between the inferred haplotypes.
# change the anchor gene columns # For D2-21*01 and J6*03 we will change the column to Anchor_J03_D01 # For D2-21*02 and J6*02 we will change the column to Anchor_J02_D02

Inferring double chromosome deletion by relative gene usage
Gene usage tends to vary between individuals, and in some cases the relative gene usage of certain individuals are much lower than the rest of the population. To asses whether the low frequency arises from a deleted gene, a binomial test described in Gidoni et al. (2018) [6] was implemented.
There it was checked whether the relative gene usage of a specific gene or set of genes in an individual is lower than a chosen cutoff. For example for the IGHV genes, the chosen cutoff was 0.001. The deletionsByBinom function implements the binomial test and returns the detected gene deletion for a certain individual. The detection of a double chromosome deletion from the function can be then used for haplotype inference. This prior knowledge of deletion can raise the certainty level in the inference of the genes for where a deletion was detected. The createFullHaplotype receives a vector of the deleted genes detected. V gene labels marked in red represent low expressed genes for which deletions are inferred with lowly certainty. D gene labels marked in purple represent indistinguishable genes due to high sequence similarity, therefore the alignment call is less reliable.

Interactive Haplotype inference with double chromosome deletions
Individuals with partial V coverage tend to have multiple gene or allele assignments. This multiplicity requires special attention, as it can hinder the haplotype inference results by introducing biases. To infer the haplotypes correctly from those datasets we recommend following the initial pre-processing steps and detecting non reliable V genes using the nonReliableVGenes function.
The individual in the example dataset with the partial V coverage is I5_FR2. We shortened the sequences of individual I5 using pRESTO [4] MaskPrimers function with BIOMED-2 framework 2 (FR2) primers.
# Selecting a single individual with partial V coverage clip_db <-samples_db[samples_db$subject=='I5_FR2', ] # Detecting non reliable genes nonReliable_Vgenes <-nonReliableVGenes(clip_db) As mentioned above the double chromosome deletions can be used within the haplotype inference. For the partial V coverage dataset it is important to input the deletionsByBinom with the detect non reliable V gene list.
# Inferred deletion summary table del_binom_db <-deletionsByBinom(clip_db, chain = "IGH", nonReliable_Vgenes = nonReliable_Vgenes) Next, the haplotype can be inferred inputting the deletion summary table to the deleted_genes flag and the non reliable V genes list to the nonRelaible_Vgenes flag in the createFullHaplotype function.

Haplotype inference deletion heatmap
createFullHaplotype can infer single chromosome deletion events by the threshold of the certainty level of the inference. The function utilizes the Bayes factor (K) obtained from the posterior probability of the inference. A gene deletion event is defined when the log 10 of the K value (lK = log 10 (K)) for an "unknown" gene is larger than the threshold of 3 (The default value of the kThreshDel flag).
For a group of individuals, the single chromosome deletions coupled with the double chromosome deletions can be visualized with the deletionHeatmap function. The function creates a heatmap of the deletions inferred and colors them by the certainty level (lK).

Inferring D/J single chromosome deletion by V pooled approach
Since V gene heterozygosity is extremely common, using V genes as anchors for haplotype inference could increase the number of people for which D and J haplotype can be inferred. However, since there are far more V genes than J genes, the relative frequencies of them are lower. Therefore, to obtain a reliable haplotype inference using V genes as anchors, the sequence depth of the dataset has to be much greater than when using J6.
The RAbHIT package offers a solution to overcome the low number of sequences that connect a given V-D allele pair. RAbHIT applies an aggregation approach, in which connects information from several heterozygous V genes can be combined to infer D gene deletions. The deletionsByVpooled function uses the V pooled approach to detect single chromosomal deletions for D and J.