On the effectiveness of compact biomedical transformers

Abstract Motivation Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension and number of layers. The natural language processing community has developed numerous strategies to compress these models utilizing techniques such as pruning, quantization and knowledge distillation, resulting in models that are considerably faster, smaller and subsequently easier to use in practice. By the same token, in this article, we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create the best efficient lightweight models that perform on par with their larger counterparts. Results We trained six different models in total, with the largest model having 65 million in parameters and the smallest having 15 million; a far lower range of parameters compared with BioBERT’s 110M. Based on our experiments on three different biomedical tasks, we found that models distilled from a biomedical teacher and models that have been additionally pre-trained on the PubMed dataset can retain up to 98.8% and 98.6% of the performance of the BioBERT-v1.1, respectively. Overall, our best model below 30 M parameters is BioMobileBERT, while our best models over 30 M parameters are DistilBioBERT and CompactBioBERT, which can keep up to 98.2% and 98.8% of the performance of the BioBERT-v1.1, respectively. Availability and implementation Codes are available at: https://github.com/nlpie-research/Compact-Biomedical-Transformers. Trained models can be accessed at: https://huggingface.co/nlpie.


Introduction
There has been an ever-increasing abundance of medical texts in recent years, both in private and public domains, which provide researchers with the opportunity to automatically process and extract useful information to help develop better diagnostic and analytic tools (Locke et al., 2021).Medical corpora can come in various forms, each with its own specific context.These include Electronic Health Records (EHR), medical texts on social media, online knowledge bases, and scientific literature (Kalyan and Sangeetha, 2020).
Recent advances in Natural Language Processing (NLP) and deep learning have made it possible to computationally process biomedical texts as varied as the above using powerful generic methods † The two authors contributed equally to this work.

Figure 1:
The two general strategies proposed for training compact biomedical models.The first approach is to directly distil a compact model from a biomedical teacher which in our work is BioBERT-v1.1.The distillation depicted in this figure is the same technique used for obtaining Dis-tilBioBERT.TinyBioBERT and CompactBioBERT, on the other hand, employ different approaches, which are not shown here.The second method involves additionally pre-training a compact model on biomedical corpora.For this approach, we use compact models which have been distilled from powerful teachers, namely, DistilBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2020), and Mo-bileBERT (Sun et al., 2020).
that learn text representations based on contextual information around each word.These methods alleviate the need for cumbersome feature engineering or extensive preprocessing, and when combined with the appropriate GPU technology, can handle large volumes of data with a high level of efficiency (Wu et al., 2020).Contextualised embeddings like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), while having been derived primarily using a generic language modelling objective, are able to capture task-agnostic and generalisable syntactic and semantic properties of words in their context, making them useful for various downstream applications (Ethayarajh, 2019;Tenney et al., 2019).
In recent years, different 'probing' methods have been developed to study different aspects of word embeddings and understand their internal mechanics (Conneau et al., 2018;Jawahar et al., 2019;Clark et al., 2019).These studies have shown that BERT encapsulates a surprising amount of knowledge about the world and can be used to solve tasks that traditionally would require encoded information from knowledge-bases (Rogers et al., 2020).These models are not without their own drawbacks and come with certain limitations.For example, it has been shown that BERT does not understand negation by default (Ettinger, 2020) or struggles with representations of numbers (Wallace et al., 2019).Regardless of these shortcomings, BERT and its different variants are still the state-of-theart in different areas of NLP.
With the advent of the transformers architecture (Vaswani et al., 2017), the NLP community has moved towards utilising pre-trained models that could be used as a strong baseline for different tasks and also serve as a backbone to other sophisticated models.The standard procedure is to use a general model pre-trained on a very large amount of unstructured text and then fine-tune the model and adapt it to the specific characteristics of each task.Most state-of-the-art NLP models are based on this procedure.
A related alternative to the standard pretrain and fine-tune approach is domain-adaptive pretraining, which has been shown to be effective on different textual domains.In this paradigm, instead of finetuning the pretrained model on the task-specific labelled data, pretraining continues on the unlabeled training set.This allows a smaller pretraining corpus, but one that is assumed to be more relevant to the final task (Gururangan et al., 2020).This method is also known as continual learning, which refers to the idea of incrementally training models on new streams of data while retaining prior knowledge (Parisi et al., 2019).
NLP researchers working with biomedical data have naturally started to incorporate these techniques into their models.Apart from vanilla fine-tuning on medical texts, specialised BERT-based models have also been developed that are specifically trained on medical and clinical corpora.ClinicalBERT (Huang et al., 2019), SciBERT (Beltagy et al., 2019a), and BioBERT (Lee et al., 2020) are successful attempts at developing pretrained models that would be relevant to biomedical NLP tasks.They are regularly used in the literature to develop the latest best performing models on a wide range of tasks.
Regardless of the successes of these architectures, their applicability is limited because of the large number of parameters they have and the amount of resources required to employ them in a real setting.For this reason, there is a separate line of research in the literature to create compressed versions of larger pretrained models with minimal performance loss.DistilBERT (Sanh et al., 2019), MobileBERT (Sun et al., 2020), and TinyBERT (Jiao et al., 2020) are prominent examples of such attempts, which aim to produce a lightweight version of BERT that closely mimics its performance while having significantly less trainable parameters.The process used in creating such models is called distillation (Hinton et al., 2015).
In this work we first train three distilled versions of the BioBERT-v1.1 using different distillation techniques, namely, DistilBioBERT, CompactBioBERT, and TinyBioBERT.Following that, we pretrain three well-known compact models (DistilBERT, TinyBERT, and MobileBERT) on the PubMed dataset using continual learning.The resultant models are called BioDistilBERT, BioTinyBERT, and BioMobileBERT.Finally, we compare our models to BioBERT-v1.1 through a series of extensive experiments on a diverse set of biomedical datasets and tasks.The analyses show that our models are efficient compressed models that can be trained significantly faster and with far fewer parameters compared to their larger counterparts, with minimal performance drops on different biomedical tasks.
To the best of our knowledge, this is the first attempt to specifically focus on training compact models on biomedical corpora and by making the models publicly available we provide the community with a resource to implement powerful specialised models in an accessible fashion.
The contributions of this paper can be summarised as follows: • We are the first to specifically focus on training compact biomedical models using distillation and continual learning.
• Utilising continual learning via the Masked Language Modelling (MLM) objective, we train three well-known pre-trained compact models, namely DistilBERT, MobileBERT, and TinyBERT for 200k steps on the PubMed dataset.
• We evaluate our models on a wide range of biomedical NLP tasks that include Named Entity Recognition (NER), Question Answering (QA), and Relation Extraction (RE).
• We make all of our 6 compact models freely available on Huggingface and Github.These models cover a wide range of parameter sizes, from 15M parameters for the smallest model to 65M for the largest.

Background
Pre-training followed by fine-tuning has become a standard procedure in many areas of NLP and forms the backbone for most state-of-the-art models such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020).The goal of language model pre-training is to acquire effective in-context representations of words based on a large corpus of text, such as Wikipedia.This process is often self-supervised, which means that the representations are learned without using human-provided labels.There are two main strategies for self-supervised pre-training, namely, MLM and Causal Language Modeling (CLM).In this work, we focus on models pre-trained with the MLM objective.

Masked Language Modeling
MLM is the process of randomly omitting portions of a given text and having the model predict the omitted portions.The masking percentage is normally 15%, with an 80% probability that the selected word will be substituted with a specific mask token (e.g.<MASK>) and a 20% chance that it will be replaced with another random word (Devlin et al., 2019).Contextualised representations generated using these pre-trained language models are referred to as bidirectional, which means that information from previous and following contexts is used to construct representations for each given word.
MLM utilises distributional hypothesis, an idea introduced originally by Harris (1954) and later popularised by Firth (1957).The premise is that words that occur in the same contexts tend to have a similar meaning, or as Firth phrased it, "a word is characterised by the company it keeps".As a result, BERT shares conceptual similarities with other representation learning schemes in NLP.There is strong evidence to suggest that MLM relies on distributional semantic information significantly more than grammatical structure of sentences (Sinha et al., 2021).

BERT: Bidirectional Encoder Representation from Transformers
The most prominent transformer pre-trained with MLM is BERT.BERT is an encoder-only transformer that relies on the Multi-Head Attention mechanism for learning in-context representations.
BERT has different variations such as BERT base and BERT large which vary in the number of layers and the size of the hidden dimension.Original BERT is trained on English Wikipedia and BooksCorpus datasets for about 1 million training steps, making it a strong model for various downstream NLP tasks.
Fine-tuning pre-trained BERT on a downstream task involves training the model for a few more epochs using a labelled dataset and with a lower learning rate (Sun et al., 2019).It has been shown that, since this procedure only affects the weights in the top layers of BERT, it will not lead to catastrophic forgetting of linguistic information (Merchant et al., 2020).

BioBERT and other Biomedical Models
While generic pre-trained language models can perform reasonably well on a variety of downstream tasks even in domains other than those on which they have been trained, in recent years researchers have shown that continual learning and pre-training of language models on domain-specific corpora leads to noticeable performance boosts compared to simple fine-tuning.BioBERT is an example of such a domain-specific BERT-based model and the first that is trained on biomedical corpora.
BioBERT takes its initial weights from BERT base (pre-trained on Wikipedia + Books) and is further pre-trained using the MLM objective on the PubMed and optionally PMC datasets.BioBERT has shown promising performance in many biomedical tasks including NER, RE, and QA.Aside from BioBERT, numerous additional models have been trained entirely or partially on biomedical data, including ClinicalBERT (Huang et al., 2019), SciBERT (Beltagy et al., 2019b), BioMedRoBERTa (Gururangan et al., 2020), and BioELECTRA (Kanakarajan et al., 2021).

Overparametrisation of Language Models
The BERT base model has 110M parameters, which is a modest number compared to T5 (111B), GPT-3 (175B), or MT-NLG (530B).Training models of this magnitude comes with considerable financial and environmental costs.This trend is unlikely to be reversed anytime soon given the increasing computational power and the resources that large technology companies devote to creating such models (Bender et al., 2021).Strubell et al. (2019) studied several major transformer-based models and estimated the carbon footprint and cloud compute costs incurred during their training.Warning against environmentally unfriendly practices in AI and NLP research has created interest in the community to develop lighter but computationally efficient models that come with minimal reduction in performance.This trend has been described as 'Green AI' (Schwartz et al., 2020).Model compression can be considered a step in this direction.It is predicated on the idea of creating a quick and compact model to imitate a slower, bigger, but more performant model (Bucilua et al., 2006).Several different model compression methods exist, with the aim to encode large models and create smaller more compact versions of them.The present work focuses on knowledge distillation but we will also briefly mention quantisation and pruning.

Quantisation and Pruning
Quantisation is a technique that attempts to reduce the memory footprint of a pre-trained language model by reducing the precision of its weights and uses low bit hardware operations to speed up computation (Shen et al., 2020).It is an effective method for model compression and acceleration that can be applied to both pre-trained models or models trained from scratch (Cheng et al., 2017).This method requires hardware compatibility to function (Rogers et al., 2020).
Pruning is another model compression method that disables certain parts of a larger model to create a compressed faster version of it.It has been shown that zeroing out different parts of the multi-head attention mechanism in BERT does not result in a significant drop during inference time (Michel et al., 2019).Pruning can be performed in a structured way, where certain components of the model are removed, or in an unstructured fashion, where weights are dropped regardless of location in the network (Rogers et al., 2020).Since quantisation and pruning are independently developed and complementary to each other, they can be used in tandem to develop a single compressed model.

Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) is the process of transferring knowledge from a larger model called "teacher" to a smaller one called "student" using the larger model's outputs as soft labels.Distillation can be done in a task-specific way where the pre-trained model is first fine-tuned on a task and then the student attempts to imitate the teacher network.This is an effective method, however, fine-tuning of a pre-trained model can be computationally expensive.Task-agnostic distillation, on the other hand, allows the student to mimic the teacher by looking at its masked language predictions or intermediate representations.The student can subsequently be directly fine-tuned on the final task (Wang et al., 2020;Yao et al., 2021).
DistilBERT is a well-known example of a compressed model that uses knowledge distillation to transfer the knowledge within the BERT base model to a much smaller student network which is about 40% smaller and 60% faster.It uses a triple loss which is a linear combination of language modeling, distillation and cosine-distance losses.

Approach
In this work, we focus on training compact transformers on biomedical corpora.Among the available compact models in the literature, we use DistilBERT, MobileBERT, and TinyBERT models which have shown promising results in NLP.We train compact models using two different techniques as shown in Figure 1.The first is continual learning of pre-trained compact models on biomedical corpora.In this strategy, each model is further pre-trained on the PubMed dataset for 200k steps via the MLM objective.The obtained models are named BioDistilBERT, BioMobile-BERT, and BioTinyBERT.
For the second strategy, we employ three distinct techniques: the DistilBERT and TinyBERT distillation processes, as well as a mixture of the two.The obtained models are named DistilBioBERT, TinyBioBERT, and CompactBioBERT.We test our models on three well-known biomedical tasks and compare them with BioBERT-v1.1 as shown in Tables 1 to 6.

Methods
In this section, we describe the internal architecture of each compact model that is explored in the paper, the method used to initialise its weights, and the distillation procedure employed to train it.

Initialisation of the Student
Effective initialisation of the student model is critical due to the size of the model and the computational cost of distillation.As a result, there are numerous techniques available for initialising the student.One method introduced by Turc et al. (2019) is to initialise the student via MLM pretraining and then perform distillation.Another approach, which we have followed in this work, is to take a subset of the larger model by using the same embedding weights and initialising the student from the teacher by taking weights from every other layer (Sanh et al., 2019).With this approach, the hidden dimension of the student is restricted to that of the teacher model.

Distillation Procedure
For distillation, we mainly follow the work of Sanh et al. (2019) in which the loss is a combination of three different terms.In this section, we explain each of these in detail.The first term is normal cross entropy loss used for the MLM objective which can be expressed with the below equation: where X is the input of the model, Y denotes MLM labels which is a collection of N one-hot vectors each with the size of |V | where |V | is the size of the vocabulary of the model and N is the number of input tokens1 and W n is 1 for masked tokens and 0 for others.This ensures that only masked tokens will contribute to the computation of loss.f s represents the student model whose output is a probability distribution vector with the size of the vocabulary (|V |) for each token.
The second loss term used for distillation is a KL Divergence loss over the outputs (aka soft labels) of the teacher model which can be expressed in the below equation where f t represents the teacher model: Finally, there is an optional loss that is intended to align the last hidden state of the teacher and student models via a cosine embedding loss: where h t and h s represent functions that output the last hidden state of the teacher and student models respectively (each of which is a collection of N , D-dimensional vectors where D is the size of the hidden dimension) and φ is a cosine similarity function2 .

TinyBioBERT
This model uses a unique distillation method called 'transformer-layer distillation' which is applied on each layer of the student to align the attention maps and the hidden states of the student with the teacher.

Architecture
This model is available in two sizes: The first one is a 4-layer transformer with a hidden dimension and embedding size of 312 and about 15M parameters.The second is a 6-layer transformer with the same design as DistilBERT, as described in Section 4.1.This model contains around 30.5K words in its vocabulary and employs an uncased tokeniser, which means it does not include upper-cased letters in its vocabulary.

Initialisation of the Student
The initial weight initialisation of this model is random since the hidden and the embedding size of this model differ from its teacher.However, the weight initialisation of the DistilBERT can be used when the hidden and embedding size of the student are the same as the ones in the teacher which to the best of our knowledge was not tried in the original paper.

Transformer-layer distillation
This distillation is applied on attention maps and outputs of each transformer layer of the student along with the final output layer and embedding layer of the student.Since the student is smaller than the teacher, the numbers of layers are not equal.As a result, each layer of the student will be mapped to a specific layer of the teacher with which the distillation will be performed.The mapping from the student layer index to the corresponding teacher layer index is determined by the equation below: where i is the index of the student layer, g(.) is the mapping function, and T i is the index of the respective transformer layer of the teacher.In both models, g(0) = 0 which is the index of the embedding layer and g(M + 1) = N + 1 which is the index of the output layer.
The mean squared error loss between each student layer and its corresponding layer in the teacher is calculated as follows: where h l s (X) and h g(l) t (X) will output the hidden states of the l th layer of the student and the g(l) th of the teacher respectively.a l s (X) and a g(l) t (X) will output the attention maps of the l th layer of the student and the g(l) th of the teacher, respectively.Because these models use multi-head attention, we have H attention maps per layer, and the mean squared error is applied to each head independently, as shown in the Equation 6.Finally, W h is a projection weight used when the hidden dimensions of the student and the teacher are not the same.
In addition to the transformer-layer loss described above, TinyBERT use two additional losses, one for the embedding layer and one for the student's output probabilities.The embedding loss is designed to align the embedding of the student (E s ) with that of the teacher (E t ).This loss is only required if the student and teacher do not share the same embedding layer.The embedding loss is expressed in the below equation: where W e is a projection weight as discussed in Equation 6. TinyBERT employs one additional loss to align the final probability distributions of teacher and student, which is a cross entropy loss over the teacher's soft labels: The complete loss function used for TinyBERT distillation is as follows: where λ 0 to λ (M+1) are hyperparameters, controlling the importance of each layer.In this work all lambdas are set to 1.0.

CompactBioBERT
This model has the same overall architecture as DistilBioBERT (Section 4.1), with the difference that here we combine the distillation approaches of DistilBERT and TinyBERT.We utilise the same initialisation technique as in DistilBioBERT, and apply a layer-to-layer distillation with three major components, namely, MLM, layer, and output distillation.
Layer distillation is performed between each student layer and its corresponding teacher layer based on Equation 6, with the MSE losses substituted with cosine embedding loss for hidden states alignment and KL Divergence for attention maps alignment.Below is the final layer distillation loss proposed for CompactBioBERT: The MLM and output distillations are the same losses used in DistilBioBERT.MLM distillation corresponds to L mlm (X, Y ) in Equation 1and L sof tMLM (X) denotes output distillation from Equation 2. Finally, the complete distillation loss used in CompactBioBERT is as follows: where α 1 , α 2 , and α 3 are weighting terms for combining different losses.In our settings, α 1 = 1.0, α 2 = 5.0, and α 3 = 3.0.

BioMobileBERT
MobileBERT (Sun et al., 2020) is a compact model that uses a unique design comprised of different components to reduce the model's width (hidden size) while maintaining the same depth as BERT large (24 Transformer Layers).MobileBERT has proved to be competitive in many NLP tasks while also being efficient in terms of both computational and parameter complexity.

Architecture and Initialisation
MobileBERT uses a 128-dimensional embedding layer followed by 1D convolutions to up-project its output to the desired hidden dimension expected by the transformer blocks.For each of these blocks, MobileBERT uses linear down-projection at the beginning of the transformer block and upprojection at its end, followed by a residual connection originating from the input of the block before down-projection.Because of these linear projections, MobileBERT can reduce the hidden size and hence the computational cost of multi-head attention and feed-forward blocks.This model additionally incorporates up to four feed-forward blocks in order to enhance its representation learning capabilities.Thanks to the strategically placed linear projections, a 24-layer MobileBERT (which is used in this work) has around 25M parameters.To the best of our knowledge MobileBERT is initialised from scratch.

Distillation Procedure
MobileBERT uses layer-wise distillation similar to TinyBERT (Jiao et al., 2020) and .Unlike TinyBERT, where the student's hidden dimension and number of layers may differ from those of the teacher, MobileBERT utilises a unique teacher named IB-BERT which has the same hidden dimension and number of layers as the student3 .As a result, mapping each transformer layer in the student to its matching teacher layer is unnecessary.
The loss employed by MobileBERT for layer-wise distillation is shown below: The loss4 used for distillation of the MobileBERT is as follows: where M is the number of transformer layers and α is a hyperparameter between (0, 1).

Experiments and Results
We evaluate our models on three biomedical tasks, namely, NER, QE, and RE.For a fair comparison, we fine-tune all of our models using a constant seed.Note that the results obtained in this work are for comparison with BioBERT-v1.1 in a similar setting and we are not focusing on reproducing or outperforming state-of-the-art on any of the datasets since that is not the objective of this work.
We distil our students solely from BioBERT and also compare our continually learnt models with it.While there are other recent biomedical transformers available in the literature (Sec.1), BioBERT is the most general (trained on large biomedical corpora for 1M steps) and is widely used as a backbone for building new architectures.Direct comparison with one major model helps us to keep the work focused on compression techniques and assessing their efficiency in preserving information from a well-performing and reliable teacher.These experiments can in the future be expanded to cover other biomedical models.
For biomedical NER we use 8 well-known datasets, namely, NCBI-disease (Dogan et al., 2014), BC5CDR (disease and chem) (Li et al., 2016), BC4CHEMD (Krallinger et al., 2015), BC2GM (Smith et al., 2008), JNLPBA (Kim et al., 2004), LINNAEUS (Gerner et al., 2010), and Species-800 (Pafilis et al., 2013) which will test the biomedical knowledge of our models in different categories such as Disease, Drug/chem, Gene/protein, and Species.All of our models were trained for 5 epochs with a batch size of 16 and a learning rate of 5e − 5.In a few cases, a learning rate of 3e − 5 and a batch size of 32 were also used.Because our models contain word-piece tokenisers which may split a single word into several sub-word units, we assigned each word's label to all of its sub-words and then fine-tuned our models based on the new labels.As shown in Table 1, DistilBioBERT and CompactBioBERT outperformed other distilled models on all the datasets.Among the continually learned models, BioDistilBERT and BioMobileBERT fared best (Table 2), while TinyBioBERT and BioTinyBERT were the fastest and most efficient models.
For RE we used the GAD (Bravo et al., 2015) and CHEMPROT (Krallinger et al., 2017) datasets and followed the same pre-processing used in Lee et al. (2020).On the GAD dataset, we randomly selected 10% of the data for the test set using a constant seed and used the rest for training.For both datasets, we trained all of our models for 3 epochs with learning rates of 5e − 5 or 3e − 5 and a batch size of 16.We used the latest version of CHEMPROT which has 13 different types of relations.
CompactBioBERT achieved the best results in both tasks among the distilled models (Table 3), and similarly, BioDistilBERT outperformed all of our continually trained models in both tasks (Table 4).
For QA, we used the BioASQ 7b dataset (Tsatsaronis et al., 2015) and followed the same preprocessing steps as Lee et al. (2020).All the models were trained with a batch size of 16.For TinyBERT, TinyBioBERT, and BioTinyBERT a learning rate of 5e − 5 was used while for the remaining models this value was set to 3e − 5.As seen in Table 5, among our distilled models CompactBioBERT and TinyBioBERT performed best, and among our continually learned models BioMobileBERT and BioDistilBERT outperformed other distilled models (Table 6).

Discussion
In this study, we investigated two approaches for compressing biological language models.The first strategy was to distil a model from a biomedical teacher, and the second was to use MLM pretraining to adapt an already distilled model to a biomedical domain.Due to computational and time constraints, we trained our distilled models for 100k steps and our continually learned models for 200k steps; as a result, directly comparing these two types of models may be unfair.We observed that distilling a compact model from a biomedical teacher increases its capacity to perform better on complex biomedical tasks while decreasing its general language understanding and reasoning.This means that while our distilled models perform exceptionally well on biomedical NER and RE (Tables 1 and 3), they perform comparatively poorly on tasks that require more general knowledge and language understanding such as biomedical QA (Table 5).
Weaker results on QA (compared to continually learned models) suggest that by distilling a model from scratch using a biomedical teacher, the model may lose some of its ability to capture complex grammatical and semantic features while becoming more powerful in identifying and understanding biomedical correlations in a given (as in Table 3).On the other hand, adapting already compact models to the biomedical domain via continual learning seems to preserve general knowledge regarding natural language structure and semantics in the model (Table 6).It should be noted that the distilled models are only trained for 100k steps and this analysis is based on the current results obtained by these models.
Furthermore, despite having nearly half as many parameters, BioMobileBERT outscored BioDistil-BERT on QA.As previously stated, MobileBERT employs a unique structure that allows it to get as deep as 24 layers while maintaining less than 30M parameters.On the other hand, BioDistilBERT is only 6 layers deep.Because of this architectural difference, we hypothesise that the increased number of layers in BioMobileBERT allows it to capture more complex grammatical and semantic features, resulting in superior performance in biomedical QA, which requires not only biomedical knowledge but also some general understanding about natural language.
We trained models of varied sizes and topologies, ranging from small models with only 25M parameters to larger models with up to 65M.In our experiments, we discovered that when fine-tuned with a high learning rate (e.g.5e − 5), our tiny models, TinyBioBERT and BioTinyBERT, perform well on downstream tasks while our bigger models tend to perform better with a lower learning rate (e.g.3e − 5).
In addition, we found that compact models that have been trained on the PubMed dataset for fewer training steps (e.g.50k) tend to achieve better results on more general biomedical datasets such as NCBI-disease which are annotated for disease mentions and concepts and perform worse on more specialised datasets like BC5CDR-disease and BC5CDR-chem which include extra domain-specific information (e.g.chemicals and chemical-disease interactions), and the reverse is true for the models that are trained longer on the PubMed dataset.
TinyBioBERT and BioTinyBERT are the most efficient models in terms of both memory and time complexity (as evidenced by Figure 4).DistilBioBERT, CompactBioBERT, and BioDistilBERT are the second most efficient set of models in terms of time complexity.BioMobileBERT, on the other hand, is the second most efficient model with regards to memory complexity.In conclusion, if efficiency is the most important factor, the tiny models are the most suitable resources to use.In other use cases, we recommend either the distilled models or BioMobileBERT depending on the relative importance of memory, time, and accuracy.

Conclusion
In this work, we employed a number of compression strategies to develop compact biomedical transformer-based models that proved competitive on a range of biomedical datasets.We introduced six different models ranging from 15M to 65M parameters and evaluated them on three different tasks.We found that competitive performance may be achieved by either pre-training existing compact models on biomedical data or distilling students from a biomedical teacher.The choice of distillation or pre-training is dependent on the task, since our pre-trained students outperformed their distilled counterparts in some tasks and vice versa.We discovered, however, that distillation from a biomedical teacher is generally more efficient than pre-training when using the same number of training steps.Due to computational and time constraints, we trained all of our distilled models for 100k steps, and for continual learning, we trained models for 200k steps.For future work, we plan to pre-train models for 500k to 1M steps and publicly release the new models.In addition, since CompactBioBERT and DistilBioBERT performed similarly on most of the tasks, we plan to investigate the effect of hyperparameters on training these models in order to determine which distillation technique is more efficient.Some of the compact biomedical models proposed in this study may be used for inference on mobile devices, which we hope will open new avenues for researchers with limited computational resources.

Table 1 :
Test results for the models that were directly distilled from the BioBERT-v1.1 on NER datasets.The * symbol indicates that any direct comparison should take into account the fact that other models include over 60M parameters, whereas TinyBioBERT has only 15M.

Table 2 :
NER test results for models that were pre-trained on the PubMed dataset via the MLM objective and continual learning.Note that the models beginning with the prefix 'Bio' are pretrained, while the rest are baselines.

Table 3 :
Test results of the models that were directly distilled from the BioBERT-v1.1 on RE datasets.The * symbol indicates that any direct comparison between TinyBioBERT and other models should account for the significance difference in model size (15M vs 60M ).Scores for GAD are in the binary mode and the metrics reported for CHEMPROT are macro-averaged.

Table 4 :
Test results on RE datasets for the models that were pre-trained on PubMed via MLM objective and continual learning.Model names beginning with the prefix 'Bio' are pre-trained and the others are baselines.Scores for GAD are in the binary mode and the metrics reported for CHEMPROT are macro-averaged.

Table 5 :
Test results of the models that were directly distilled from the BioBERT-v1.1 on the BioASQ QA dataset.The metrics used for reporting the results are taken from the BioASQ competition and the models were assessed using the same evaluation script.The metrics are as follows: Strict Accuracy (S), Lenient Accuracy (L) and Mean Reciprocal Rank (M).

Table 6 :
BioASQ QA test results for the models that were pre-trained on the PubMed dataset via MLM objective and continual learning.The metrics used for reporting the results are taken from the BioASQ competition and the models were assessed using the same evaluation script.The metrics are as follows: Strict Accuracy (S), Lenient Accuracy (L) and Mean Reciprocal Rank (M) scores.