nanoBERT: a deep learning model for gene agnostic navigation of the nanobody mutational space

Abstract Motivation Nanobodies are a subclass of immunoglobulins, whose binding site consists of only one peptide chain, bestowing favorable biophysical properties. Recently, the first nanobody therapy was approved, paving the way for further clinical applications of this antibody format. Further development of nanobody-based therapeutics could be streamlined by computational methods. One of such methods is infilling—positional prediction of biologically feasible mutations in nanobodies. Being able to identify possible positional substitutions based on sequence context, facilitates functional design of such molecules. Results Here we present nanoBERT, a nanobody-specific transformer to predict amino acids in a given position in a query sequence. We demonstrate the need to develop such machine-learning based protocol as opposed to gene-specific positional statistics since appropriate genetic reference is not available. We benchmark nanoBERT with respect to human-based language models and ESM-2, demonstrating the benefit for domain-specific language models. We also demonstrate the benefit of employing nanobody-specific predictions for fine-tuning on experimentally measured thermostability dataset. We hope that nanoBERT will help engineers in a range of predictive tasks for designing therapeutic nanobodies. Availability and implementation https://huggingface.co/NaturalAntibody/.


Introduction
Nanobodies (also called VHH, single domain antibodies) are a class of immunoglobulins, with binding sites consisting of only one polypeptide chain as opposed to two in canonical, human antibodies.The compact format bestows favorable biophysical properties, including high stability, greater tissue penetration and others (Flajnik et al. 2011, Bannas et al. 2017).Recent approval of the first nanobody therapeutic proved the feasibility of this format in human medicine (Morrison 2019), further reflected with an increasing patenting activity around nanobodies (Krawczyk et al. 2021).
Design of novel therapeutic nanobodies can be facilitated by computational methods (Wilman et al. 2022), such as structural modeling (Cohen et al. 2022, Abanades et al. 2023) or deimmunization/humanization (Sang et al. 2021, Ramon et al. 2024).Though cognate to antibodies, nanobodies bear a range of properties that set them apart from canonical antibodies (Li et al. 2016, Mitchell and Colwell 2018a,b, Gordon et al. 2023).For this reason, nanobodies can benefit from the breadth of computational approaches developed for antibodies but require fine-tuning towards this specific molecule type (Norman et al. 2020, Wilman et al. 2022).
Computationally guided antibody engineering requires a mutational map of the candidate molecule indicating feasible and non-feasible mutations to help in activities such as humanization, affinity design, liabilities removal and others (Sang et al. 2021).Such predictions are currently possible owing to the large number of Next-Generation Sequencing (NGS) samples deposited in the public domain, delineating the biologically acceptable mutations (Kovaltsuk et al. 2018, Olsen et al. 2022a, Briney 2023).
Mutational maps of antibodies can be created as position specific scoring matrices (PSSM) (Smith et al. 2023), or creating multiple sequence alignments (MSA) of millions of NGS sequences sharing a single germline origin (Schmitz et al. 2020, Młokosiewicz et al. 2022).Other studies have demonstrated the feasibility of gene agnostic language models of the human antibody space (Ruffolo et al. 2021, Shuai et al. 2021, Leem et al. 2022, Olsen et al. 2022b).Here, large transformer models (Vaswani et al. 2017, Devlin et al. 2019) are tasked with predicting obscured residues.Such transformerbased approaches have the advantage over PSSM/MSA based approaches in that they offer mutational predictions taking the entire context of the sequence into account rather than single positions.
The specific case of infilling nanobody sequences, could benefit from previous approaches developed for antibodies, either PSSM/MSA-based or transformer-based approaches.Though there exist reliable germline gene references for humans (Lefranc et al. 1999, Smakaj et al. 2020), the camelid, nanobody reference is not complete to the same extent (Tu et al. 2020).In addition, the highly diverse germline segments, extended mutation hotspot regions and increased hypermutation frequency in VHH posed additional challenges for computational modeling (Nguyen et al. 2000).Without gene assignment, creation of reliable PSSMs/MSAs is challenging.Therefore between PSSM/MSA and the transformer approach, only the latter remains as a feasible way of offering a mutational guide for nanobodies.
To address this issue, here we present a nanoBERT-a Bidirectional Encoder Representations from Transformers (BERT) model trained on ten million NGS sequences from the Integrated Nanobody Database for Immunoinformatics (INDI) (Deszy� nski et al. 2022).Our model side-steps the need for reliable nanobody gene assignments offering a nanobody-specific mutational map of these molecules.

Datasets employed
The training set for nanoBERT was compiled as the 10 m nonredundant Next-generation sequencing nanobodies from INDI.A total of 100 000 sequences were left as a validation set.We did not set aside a test set from this dataset as we believed that it would not be sound and instead opted from an entirely independent dataset (Table 1).Germline assignment of nanobody sequences were performed using ANARCI (Dunbar and Deane 2016).
We used four test sets for infilling benchmarking (Table 1).As the blind test set we employed an internal NaturalAntibody dataset.The dataset comes from two llamas, whose repertoires were sequenced with approximately 500 000 sequences.This dataset does not form part of any public dataset and thus there should be no data leakage between test and train.For computational expediency, a set of 1000 nanobodies was sampled to constitute a natural test set.As the therapeutic test set we compiled a list of 18 therapeutic nanobodies from public sources (Raybould et al. 2020, Deszy� nski et al. 2022).These were Caplacizumab, Enristomig, Envafolimab, Gefurulimab, Gontivimab, Isecarosmab, Letolizumab, Lunsekimig, Ozekibart, Ozoralizumab, Porustobart, Rimteravimab, Sonelokimab, Tarperprumig and Vobarilizumab.Sonelokimab, Lunsekimig were multivalent, contributing three and two sequences, respectively.This dataset was employed to indicate how well a model trained on natural sequences reconstructs therapeutic nanobody sequences.The mouse dataset was employed to contrast how well the models distinguish between different organisms through nativeness calculation.Here, we sampled a set of 1000 non-redundant full V-region mouse sequences from OAS (Kovaltsuk et al. 2018).
For the fine-tuning experimental dataset we employed the NBThermo dataset (Vald� es-Tresanco et al. 2023) which consists of nanobody sequences associated with thermostability measurements (Table 1).

Transformer models employed
We trained a machine learning model based on the BERT paradigm following previous protocols, namely these of AntiBerta (Leem et al. 2022) and AntiBerty (Ruffolo et al. 2021).The objective task of our models was masked language modeling of 10 m nanobody sequences from INDI (Table 1).We created two models, nanoBERT_big and nanoBERT_small.The nanoBERT_big model closely resembled the architecture of AntiBerta with 86 m parameters and embedding size of 768.For comparison, we also created nanoBERT_small, with 14 m parameters and embedding size of 320 to check whether a smaller, more computationally efficient model would be comparable in terms of performance to the bigger one created using a standardized protocol.
For comparison of nanobody and human antibody models, we employed three language models trained solely on human data.We trained two heavy chain human models, human_320 and human_640, with 14 and 160 m parameters respectively.The human_320 model resembles nanoBERT_small in terms of its architecture but is trained on 25 million human heavy chains.The human_640 model is used to check whether scaling the number of parameters on the same dataset would achieve better prediction on nanobodies.As an external dataset, we also employed AbLang_heavy, which is a publicly available language model trained on human antibody sequences with the goal of sequence infilling.We note that there exist language models with inbuilt nanobody capacity, namely IgLM (Shuai et al. 2021) and AbNativ (Ramon et al. 2024), but we did not compare infilling against these two.The former performs sequence generation so it is not deterministic, making comparison unsound.To the best of our knowledge AbNativ provides a single sequence score indicating nativeness, without sequence infilling.
Finally, to compare the nanobody-specific language model to state-of the art protein language model not focused on a particular protein type, we employed ESM-2.We used the version with 650 m parameters (facebook/esm2_t33_650M_UR50D) as the largest model we could compare head-to-head given our technical capacity.

Fine-tuning experiments
We performed fine-tuning experiments on nativeness and thermostability calculations.
For the nativeness, we employed the test-set data from Table 1, to create two datasets, human vs nanobodies and mouse, as well nanobodies vs human and mouse.Each dataset was then split in proportion of 8:1:1.Fine-tuning consisted of adding a four-layer dense network with sigmoid output and binary cross-entropy loss.
In case of thermostability, we employed the NBThermo dataset (Table 1), splitting it between the Circular Dichroism, DSF (SYPRO) as well as putting all the data in one set.The thermostability values were mapped to range [0,1] and the same four-layer dense network was employed for a regression task, but with linear output and mean squared error loss.

Availability
We trained and benchmarked the models within the Hugging Face framework.We make our best nanobody model (nanoBERT_small) and a reference human heavy model (hu-man_320) available under the link: https://huggingface.co/nat uralantibody/.Model documentation contains a python notebook that can be easily run in Google Colab to demonstrate nanobody sequence infilling.The models can also be easily cloned from Hugging Face for custom applications such as fine-tuning, standardized within this framework.

Current camelid germlines are not sufficient to create reliable position specific scoring matrices
Before turning to language modeling, we tested the possibility of mapping the nanobody space by clustering it into genes similarly to previous work on canonical antibodies (Młokosiewicz et al. 2022).Genes were assigned to the full INDI database using ANARCI (Dunbar and Deane 2016).Total of �94% of the database were correctly assigned to camelid genes, �6% as human and less than 0.01% as "cow," "mouse,," "pig," "rabbit," and "rhesus." The chains that were correctly recognized as camelids, were unequally distributed between five V alleles (see Table 2).The very uneven assignment of germlines suggested poor gene identification, and we proceeded to evaluate the assignment by plotting the clonal distribution.
We assumed that framework portions of the sequence should undergo the least somatic hypermutation, and thus should have the minimal distance from the assigned germline.As the framework portions we employed framework 2 (FW2) and 3 (FW3) for each chain as they are present in most sequences (framework 1 is sometimes truncated in NGS).Chains without data in FW2 and FW3 were excluded.For each allele we identified the most frequent concatenated FW2 þ 3 sequence.Assuming accurate gene assignment, one would expect the germline FW2 þ 3 sequence to be the most abundant among all sequences aligned to the same gene.We then ordered all other FW2 þ 3 clones by their distance to the assigned germline sequence and plotted their frequency (see Fig. 1).
IGHV4S1 � 01 was omitted due to the low number of chains clustered to the gene (Table 2).
Assuming correct gene assignment, it is plausible to expect the germline FW2þFW3 being the centroid, and the slopes descending in Fig. 1.The fact that there appear to be regions of higher density farther away from the germline, suggests issues of the sampling nature or even existence of unmapped genes in the data.Studying the nature of these is beyond the scope of the study and could be tackled by using the NGS datasets to identify the novel germlines (Ralph and Matsen 2019).In conclusion for our infilling efforts, we were not convinced that creating a model using germline based PSSM/ MSA using the current data would be sound.

Germline-agnostic transformer-based infilling of nanobody sequences
Since we were unable to convincingly map nanobody NGS diversity using germline paradigms, we developed a transformerbased model that should be more tolerant to data not being structured by genes.Specifically, we used BERT, which is a neural network architecture that enables representation of languages, and word prediction from both right and left context (Devlin et al. 2019).In nanoBERT each residue is considered as a word and each sequence as a sentence.nanoBERT was built using the same BERT architecture as AntiBerta (Leem et al. 2022) designed for human antibodies.We developed two models, nanoBERT_big (86 m parameters) and nanoBERT_small (14 m parameters) to check whether a more computationally efficient model performs on par with the larger one.
We benchmarked nanoBERT models against human-specific models and a protein-generalistic model on three datasets: natural nanobodies, natural human antibodies and therapeutic nanobodies (Table 3).Individual positions in each test-set sequence were obscured and the model was tasked with reconstructing them.If the top prediction matched the original sequence, it was counted as a match and mismatch otherwise.The accuracy for each region was calculated as the percentage of matches for each sequence divided by its length.
Performance on nanobodies dataset was supposed to contrast nanobody-specific models versus the nanobody-nonspecific models.By symmetry, a test on human sequences was supposed to reflect whether human-specific models would outperform the nanobody-specific models.The therapeutic test set was supposed to indicate whether naturally-sourced predictions are useful in identifying mutations in nanobodies for therapeutic applications in humans.
On the natural nanobody dataset, which consisted of a sample of 1000 sequences from sequencing of two llamas, the nanoBERT models outperform all the other models by a wide margin.The nanoBERT models achieve ca.76% V region reconstruction versus ca.64% for human models.The human antibody models perform better than ESM-2 (57.4% V-region accuracy), at nanobody infilling, reflecting the cognate nature of antibodies and nanobodies.The worst performance is noted for the CDR3 which is understandable as it is the most diverse region.Most notably, the small nanoBERT model appears to be firmly within the predictive range of the larger model, though it is more computationally efficient (thus we make available the nanoBERT_small model via Hugging Face).
Benchmarking the models on the human dataset, reverses the trend and now human antibody models perform much better than the nanobody-specific models.The human

V Gene
Frequency (%) nanoBERT specific models achieve ca.91% V region reconstruction accuracy versus 61%-64% accuracy for nanoBERT models.The accuracy of nanoBERT models for the entire V-region is within the range of ESM-2 (62.5%).Therefore, the reflective case of nanobody models performing better on cognate antibodies than a generalistic protein model is not the case.Of note, the human-specific models achieve much better reconstruction rate on human sequences (ca.91%) than nanoBERT on nanobodies (ca.76%), which could be due to larger, more diverse datasets being used to train these.Finally, we tested all models on infilling the 18 therapeutic nanobody sequences.Here, the performance gap between nanoBERT and human-specific models is much smaller than on natural datasets.The nanoBERT models achieve better prediction on the entire V-region (ca.77% for nanoBERT models versus ca.70-73% for human-specific models).The biggest performance gap is noted for the entire CDR regions with nanoBERT models achieving ca.45% accuracy versus 35% accuracy for human-specific models.On the therapeutic set, all the models outperformed ESM-2 by a wide margin.
Therefore, these results demonstrate that nanobodyspecific transformer models provide benefit in objective tasks of sequence infilling that could have an application in engineering such molecules.

Single domain-based therapeutics employs nanobody hallmark residues
Application of infilling models to therapeutic nanobodies is an important problem, with potential for sequence liability removal within biologically relevant space.Cognate application is humanization-making the molecule resemble human  We count the accuracy of the query sequence matching the top prediction from a given language model.Value for best performing model is given in bold.
amino acid distribution at certain positions (Sang et al. 2021).Simply grafting the nanobody CDRs onto human frameworks is not efficient as binding specificity can be lost.
There is much evidence and data collected on murine-based deimmunization where strategically placed substitutions maximize human nativeness, without compromising stability of the molecule (Tennenhouse et al. 2023).However with only 18 nanobody therapeutics, comparable knowledge on these molecules is still being developed.
To check to what extent existing therapeutic nanobody molecules reflect their natural versus human amino acid distribution, we plotted multiple sequence alignments of the therapeutic nanobodies and the closest human germlines (Fig. 2).In all cases, the most mismatches with the human germline are accumulated in the framework 2 regions.Eight out of eighteen nanobody therapeutics have framework 2 hallmark motif FERF, as opposed to the typical human VGLW.Therefore, current engineering choices take nanobody-preferred residues (Saerens et al. 2005), even though these are not preferred substitutions according to the human amino acid distribution.
Employing a nanobody-specific infilling model can better reflect amino acid distribution preferred to such residues.Contrasting such distributions with human variable regions (Sang et al. 2021) or human transformers (e.g.human_320) might provide the initial data basis on humanization of these molecules whilst clinical trial data on anti-drug antibodies to nanobodies is accumulated.

Downstream zero shot and fine-tuning prediction tasks
Self-supervised models can be applied to a range of problems in zero-shot fashion, predicting properties they were not specifically trained on (Meier et al. 2021).Here we focused on nanoBERT nativeness, which is correlating aggregate positional residue predictions with attribution of sequences to a given species (Wollacott et al. 2019).
We calculate the nativeness as the sum of inverse exponents of the last layers in our nanoBERT_small and human_320 transformers.The lower the nativeness value, the closer the query sequence to the sequence distribution the model was trained on.We contrasted it with the single nanobody-specific nativeness score available, namely AbNativ (Ramon et al. 2024).Here nativeness is calculated using variation of the variational autoencoder, using either human or nanobody-based distributions.We contrast the nativeness for nanoBERT_small, human_320, AbNativ human and AbNativ nanobody, on our human, nanobody and mouse test sets (Fig. 3).
It is clear that the human models do a much better job in distinguishing human sequences from non-human ones (Fig. 3A and C).By contrast, both nanobody models do quite poorly on zero-shot separation of our llama nanobodies from human or mice ones (Fig. 3B and D).In both cases the nanobody distribution is fully mixed with the human one.It cannot be the case that there exist human sequences that are too similar to nanobody ones to be distinguished, otherwise the human-only models would likewise fail at separation.We further checked the difficulty of the sequence separation problem in a fine-tuning fashion, by training a four-layer dense head on top of either human_320 or nanoBERT_small.In each case, even training on a very small dataset (splitting our test sets in 8:1:1 fashion into human vs mouse and nanobody or nanobody vs human and mouse), yielded PROC AUCs approaching 1.0.
Species separation is a trivial problem that can be solved by non-machine learning methods using sequence alignment to closest germlines (Table 2).To test out transformers on a more challenging downstream task, we fine-tuned a predictor on the NBThermo dataset, that is a compilation of nanobody sequences associated with their melting temperatures (Vald� es-Tresanco et al. 2023).The entire dataset we employed consisted of 470 sequences, with the experimental measurements of SYPRO having the most sequences, 263 and 240, respectively.
We split the fine-tuning into three cases.First, training and testing on Circular Dichroism, second, training and testing on DSF (SYPRO) and finally, training and testing on all the measurements.Sequences were split into train and test sets in 8:1:1 fashion and 90% sequence identity restriction was imposed.The melting temperatures were mapped to the range Figure 3. Nativeness calculation based on nanobody and human models.Nativeness was calculated using our nanoBERT_small and human_320 models and contrasted to AbNativ VHH (nanobody) and VH (antibody) models.Please note that the nativeness scales are not comparable between AbNativ and our models.In case of AbNativ, higher values are more native whereas in case of nanoBERT_small and human_320, lower values indicate higher nativeness.(A) Nativeness calculated using the human_320.(B) Nativeness calculated using nanoBERT_small.(C) Nativeness calculated using AbNativ VH score.(D) Nativness calculated using AbNativ VHH score.
[0,1] and predicting the value in this range was the objective task of the head built on top of the transformer.
The models were applied to three splits of the NbThermo dataset.In each case, the model was fine-tuned five times and the Pearson correlation calculated on the test dataset.Multiple training sessions were applied to check the stability of the training given such a small dataset.As a comparison on the random performance on a given split, we also calculated the random baseline, sampling the prediction scores from a uniform random distribution in the range of [0,1].To show the effect of pre-training, we also trained the same model architecture but without earlier pre-training.The results of the fine-tuning experiments are given in Table 4.Both the nanobody and human models outperform the random baseline and the model without pre-training.The nanobody-specific transformer appears to have better performance on the SYPRO and has marginally worse performance on the all dataset.The human transformer achieves very poor performance on the DSF (SYPRO) dataset.Altogether the results suggest that there might be some benefit in employing a nanobody-specific model to fine-tune on nanobody-specific tasks.Nevertheless, it is still to be verified on a larger dataset, how generalizable such models are.

Discussion
We demonstrated that a nanobody-specific transformer outperforms those trained on generalistic protein data, and those focused on human heavy chains.Such infilling models could be instrumental in providing mutational choices in the process of engineering these molecules.
The clearest application of infilling is with humanizing nanobodies, that is already being addressed by experimental (Saerens et al. 2005) and machine learning protocols (Sang et al. 2021).Choosing substitutions so as to best reflect human distribution, whilst maintaining the single-domain character of a nanobody therapeutic could be challenging.Tapping onto a model of human and nanobody distribution could indicate positions that are strongly preferred in one or the other, offering evidence in favor or against certain substitutions.Any benefits of such data-driven approaches will be properly benchmarked when more immunogenicity data on nanobodies becomes available with future clinical trials.
Beyond infilling, applying models self-supervised on large sequence datasets, holds much potential in focusing these on much smaller experimental datasets.Though we have demonstrated benefits of using nanobody-specific datasets for finetuning, a much wider benchmark on encompassing multiple nanobody-specific experimental datasets would be necessary.
We make our human and nanobody models available via Hugging Face in hope that they will facilitate multiple therapeutic applications, both within the scope of sequence infilling and fine-tuning.

Figure 1 .
Figure 1.Frequency of nanobody sequences by edit distance to FW2þFW3 germline sequences.Blue circles indicate total frequency of all chains at a given distance to the germline FW2þFW3 sequence.The red crosses indicate the frequency of the most frequent unique FW2þFW3 sequence at a given edit distance.Most unique FW2þFW3 sequences are not sufficient to explain most of the non-zero edit distance variability.

Figure 2 .
Figure 2. Multiple sequence alignment of nanobody-based therapeutics.For each nanobody-based therapeutic the closest human germline was identified.Graft of the IMGT CDRs is shown along with mismatches between the closest human framework and the original therapeutic.Nanobody hallmark residues are annotated in yellow.

Table 1 .
Datasets employed for training nanoBERT and benchmarking it against other methods.
Naturally sourced nanobodies compiled from public sources (Deszy� nski et al. 2022) Therapeutic nanobody test set 18 Nanobodies developed for therapeutic applications Natural nanobody test set 1000 Sample of naturally sourced llama nanobodies, not part of any public dataset Mouse heavy chain test set 1000 Sample of 1000 mouse sequences from diverse sources (Kovaltsuk et al. 2018) Human heavy chain test set 1000 Sample of recent human NGS dataset (Jaffe et al. 2022) NBThermo dataset 470 Nanobody sequences associated with thermostability values(Vald� es-Tresanco et al. 2023) All sequences are non-redundant on the V-region level.

Table 2 .
Camelid V gene frequency of the INDI database assigned by ANARCI.

Table 3 .
Mean positional single amino acid prediction accuracy by IMGTdefined region.