Polymorphisms in immunoglobulin heavy chain variable genes and their upstream regions

Germline variations in immunoglobulin genes influence the repertoire of B cell receptors and antibodies, and such polymorphisms may impact disease susceptibility. However, the knowledge of the genomic variation of the immunoglobulin loci is scarce. Here, we report 25 novel germline IGHV alleles as inferred from rearranged naïve B cell cDNA repertoires of 98 individuals. Thirteen novel alleles were selected for validation, out of which ten were successfully confirmed by targeted amplification and Sanger sequencing of non-B cell DNA. Moreover, we detected a high degree of variability upstream of the V-region in the 5’UTR, leader 1, and leader 2 sequences, and found that identical V-region alleles can differ in upstream sequences. Thus, we have identified a large genetic variation not only in the V-region but also in the upstream sequences of IGHV genes. Our findings challenge current approaches used for annotating immunoglobulin repertoire sequencing data.


21
identical V-region alleles can differ in upstream sequences. Thus, we have identified a large genetic 22 variation not only in the V-region but also in the upstream sequences of IGHV genes. Our findings 23 challenge current approaches used for annotating immunoglobulin repertoire sequencing data.

25
Immunoglobulins are an important part of the adaptive immune system. They exert their function 26 either as the antigen receptor of B cells that is essential for the antigen presentation capacity of these 27 cells, or as secreted antibodies that survey extracellular fluids of the body. Immunoglobulins can bind These loci remain incompletely characterized due to the fact that they contain many repetitive 35 sequence segments with many duplicated genes 4 , which makes it difficult to correctly assemble short 36 reads from whole genome sequencing. Single nucleotide polymorphisms as well as copy number 37 variations are in linkage disequilibrium and make up distinct haplotypes 4 . To this date, a limited 38 number of genomically sequenced 5-7 and inferred 8,9 haplotypes of the heavy chain and the two light 39 chain loci have been described. Different databases exist for genomic immune receptor DNA 40 sequences (IMGT/GENE-DB 10 ), putative novel variants from inferred data (IgPdb 11 ) or entire immune 41 receptor repertoires (OGRDB 12 ).

42
The usage of immunoglobulin heavy chain variable (IGHV) genes and their mutational status are most 43 frequently studied in relation to cancer 13,14 , responses to vaccines 15, 16 , or in autoimmune diseases [17][18][19] .   Here, we have used previously generated AIRR-seq data 30 from naïve B-cells of 98 Norwegian 56 individuals to identify novel IGHV alleles, a selection of which we then validated from genomic DNA 57 (gDNA) of non-B cells, i.e. T cells and monocytes. We also analyzed the sequences upstream of the 58 V-region, and constructed consensus sequences for the upstream variants present in the cohort.

59
These results expand our knowledge of this important locus and deepen our understanding of allelic 60 diversity within the Caucasian population. In addition, the result of this study can be used to improve 61 the accuracy of currently used bioinformatics tools for the analysis of immunoglobulin repertoire 62 sequencing data.

65
In this study, we used an AIRR-seq dataset from a cohort of 98 individuals 30 aiming to characterize 66 novel IGHV alleles that might be present in the data, as well as analyze the sequences upstream of 67 the V-region and create a table of previously unexplored upstream variants (Fig.1). To validate our 68 inferences from the AIRR-seq data analysis, genomic DNA of the same individuals was isolated from 69 non-B cells, i.e. T cells and monocytes. The reason for using non-B cells for validation was to avoid 70 capturing sequences with somatic hypermutation that occurs in primed B cells. IgDiscover. The availability of genomic DNA of the same individuals allowed us to verify some of our findings 76 from the analysis of the AIRR-seq data. Since the validation attempts revealed polymorphisms outside of the V-77 region, we decided to analyze the upstream sequences, i.e. 5'UTR, leader 1 and leader 2. We used a custom 78 approach for this analysis based on clustering identical variants. More details about the protocols and analysis 79 can be found in the methods section.

80
We used two germline inference tools, TIgGER 22,23 and IgDiscover 24 , to characterize novel alleles and 81 to create a personalized germline reference of IGHV alleles for each individual (aka genotype). The 82 purpose of using two different software tools was to increase our confidence in the inference of novel 83 alleles. This study does not aim to serve as a comparison of the available software tools.

84
To increase the overlap between the different software results and to allow the discovery of novel 85 alleles in genes with low expression, we adjusted selected TIgGER parameters, while keeping the 86 IgDiscover parameters as default. Suspected false positive signals were filtered out from the novel 87 allele candidates using mismatch frequency as described in Methods. The mismatch frequencies are 88 depicted in Supplementary Fig.1. Novel allele candidates that were determined to be false positives 89 contained mutations A152G, T154G and A85C ( Supplementary Fig.1).

92
We first analyzed the usage of all genes and the different alleles carried by individuals in the cohort.

97
We inferred altogether 25 novel alleles (Fig.2), and we named them using the closest reference allele.

98
The majority of the novel alleles (22) were identified both with TIgGER and IgDiscover. In addition to 99 these, two novel alleles were found exclusively by IgDiscover, and one novel allele was found 100 exclusively by TIgGER.

101
Thirteen novel alleles were selected for validation by targeted amplification and subsequent Sanger   Table   104 1. Out of those thirteen alleles, ten were successfully validated. These include IGHV1-2*02_G207T,     As some of the validated novel alleles had additional polymorphisms in the intron or the leader 132 sequence, we extended our analysis of the AIRR-seq data beyond the V-region. Although introns are 133 not present in the AIRR-seq data, the sequences of the 5' untranslated region (5'UTR), leader 1, and 134 leader 2 lie upstream of the V-region and are present in the data thanks to the library preparation 135 method (Fig.1). We will refer to 5'UTR, leader 1, and leader 2 collectively as upstream sequences.

136
We decided to use the genotyped AIRR-seq data to characterize upstream sequence variants for all 137 genes and alleles. To extract the upstream sequences, we removed the VDJ and constant regions,

148
According to the constructed germline reference dataset, the lengths of leader 1 (45 nt) and leader 2 149 (10 nt) sequences appear to be well conserved across most genes, with the exception of IGHV3-150 64*01 and IGHV6-1*01 (Fig.3). The leader 1 sequences of these two genes are 3 nt longer, which 151 makes the position of ATG appear to be shifted upstream. The length of the 5'UTR is relatively 152 conserved within the same gene family, however, there is a large variability across different families.

153
Genes of the IGHV2 family have the shortest 5'UTR, while the 5'UTRs of IGHV3 genes are the 154 longest.

155
Comparison of the consensus sequences in the cohort with the reference sequences obtained from 156 the IMGT/GENE-DB 10 revealed some discrepancies between our data and the reference database.

165
Consensus sequences constructed from clusters with less than 10 sequences or with relative frequency < 0.1 166 were excluded. Each row represents a consensus upstream sequence of a V allele with 5' to 3' orientation. The 167 colors of the tiles represent the different nucleotides. The coordinates on the x-axis describe the position of 168 each nucleotide relative to the start of the V-region (5' to 3') and are therefore labeled as negative numbers.

169
Alleles with more than one consensus sequence are marked with the allele name followed by an underscore

180
Using the sequences from the IMGT reference database, we determined the distance between the 181 ATG start codon and the reference or putative TATA-box. We found that this distance varied greatly 182 between different gene families. By comparing this distance to the 5'UTR length from the AIRR-seq 183 data, we observed that the distance between the ATG and the TATA-box correlated with the length of 184 the 5'UTR ( Supplementary Fig.7). Sequences with longer ATG to TATA-box distance had longer  polymorphism, and with no sequence corresponding to allele *05 being present (Fig.4b). Originally, 190 this allele was ambiguously annotated as deriving from either IGHV3-64*05 or IGHV3-64D*06, as it 191 differs by one nucleotide from each of these alleles (Fig.4a).

192
The upstream sequences of IGHV3-64 and IGHV3-64D differ all across their length, including the 193 5'UTR, leader 1, and leader 2 (Fig.3). The upstream regions of the novel allele IGHV3-64*05_G265C 194 are identical to those of IGHV3-64D, which indicated that this is indeed an allele of IGHV3-64D and 195 not IGHV3-64. Therefore, we decided to amplify the gene IGHV3-64D using primers specific to the 196 duplicated gene only. This resulted in the novel allele being finally validated (Fig.4c). Upon obtaining 197 the full germline sequence of the novel allele, we observed that its intron matched the one of IGHV3-198 64D and not IGHV3-64 (Fig.4d).

199
The genes IGHV3-43 and IGHV3-43D are another example of duplicated genes with differences in

229
We faced several issues with missing or incomplete genomic reference sequences, which 230 complicated the design of efficient primers for verification of novel alleles. Some of our validation 231 attempts were unsuccessful resulting only in the amplification of a "wild-type" allele without a 232 polymorphism. We suspect this might be caused by allelic dropout 33,34 . As we show in our upstream 233 sequence overview (Fig.4) downstream errors in analysis. In our study, the novel allele originally annotated as IGHV3-243 64*05_G265C was later found to be derived from the gene IGHV3-64D, located on a different part of 244 the IGHV locus than IGHV3-64. As previously shown 4,5,9 , IGHV3-64D is likely a part of an alternative 245 haplotype, since it was found to be deleted in many individuals, even in this cohort 30 . These two 246 genes differ in their upstream sequences, and thanks to this distinction, we were able to correctly 247 assign the novel allele to IGHV3-64D and validate it from gDNA.

248
Our results demonstrate that polymorphisms in the upstream regions can be utilized to improve 249 annotation methods presently employed. Having said that, the genetic variation in the sequences

275
Primers for validation were designed by PrimerBLAST using the reference genome as a template.

302
The resulting sequences were trimmed to remove the vector and primer sequences. V-gene

307
The sequences were named based on the amplified gene, followed by the closest reference allele 308 and the V-region polymorphism, which was determined by IMGT V-Quest 45 or by manual annotation 309 (in cases of ambiguous annotation).

310
The gDNA sequences of validated novel alleles were submitted to GenBank and subsequently to 311 IMGT.

313
The AIRR-seq data was pre-processed as described originally 30 using pRESTO 46 . Two individuals 314 were excluded from the analysis due to low sequencing depth (<2000).

315
Novel allele discovery and genotype inference 316 Genotype inference and novel allele discovery was also performed by TIgGER v 0.3.1 and IgDiscover 317 v0.11. The pre-processed sequences were annotated by IgBLAST 1.14.0 47 with modified parameters, 318 and the IMGT germline database (24) from January 2019 was used as a reference. The results of 319 alignment and genotype inference by TIgGER were processed using a similar pipeline to the one 320 used in http://www.vdjbase.org with slight modifications.

321
We experienced that the default settings resulted in incorrect annotation for some genes. This was 322 particularly obvious for the allele IGHV5-51*03, which was incorrectly annotated as IGHV5-51*01 with 323 one mutation C45G, corresponding to the already known allele *03. These two alleles differ only by 324 one nucleotide, and it was the length of the reference allele that seemed to affect whether or not the 325 sequence was correctly annotated by IgBLAST. The reference for *03 is 2 nt shorter than the 326 reference sequence for *01, while sequences in our data corresponding to IGHV5-51*03 were 327 matching the length of allele *01. Adjusting the IgBLAST parameters --reward to 0 and --penalty to -3 328 resolved this annotation problem. These parameters were also induced manually in IgDiscover 329 alignment step.

330
For novel allele detection we tested the parameters of the TIgGER function "findNovelAlleles": 1)

370
Since the length of the 5'UTR sequences of the same gene in AIRR-seq data can vary due to whole 371 VDJ sequence length and sequencing length limitations, we needed to determine where to trim the 372 longer sequences. To do this, we first filled the ends of sequences with Ns to match the length of the 373 longest sequence for the respective gene. We then trimmed all sequences after the first position, at 374 which 95% sequences contained N.

375
After that, for each allele and for each individual, we removed all artificially added Ns. Next, we 376 estimated sequence lengths, and lengths with frequency above 0.1 were considered frequent.

377
Sequences shorter than the shortest frequent sequence length were filtered out and sequences 378 longer than the longest frequent sequence length were trimmed to match its length. By applying

411
The pipeline for novel allele discovery and genotype processing using the software tools TIgGER and

432
We would like to thank Knut E. A. Lundin for coordinating collection of blood samples of participating 433 subjects and for being responsible for the ethical approval for the project. We also thank Victor Greiff 434 for discussions and helpful advice; and Marie K. Johannesen and Bjørg Simonsen for technical 435 assistance. We would also like to express our gratitude to all study participants.