BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.


Introduction
The volume of biomedical literature continues to rapidly increase. On average, more than 3000 new articles are published every day in peer-reviewed journals, excluding pre-prints and technical reports such as clinical trial reports in various archives. PubMed alone has a total of 29M articles as of January 2019. Reports containing valuable information about new discoveries and new insights are continuously added to the already overwhelming amount of literature. Consequently, there is increasingly more demand for accurate biomedical text mining tools for extracting information from the literature.
Recent progress of biomedical text mining models was made possible by the advancements of deep learning techniques used in natural language processing (NLP). For instance, Long Short-Term Memory (LSTM) and Conditional Random Field (CRF) have greatly improved performance in biomedical named entity recognition (NER) over the last few years (Giorgi and Bader, 2018;Habibi et al., 2017;Wang et al., 2018;Yoon et al., 2019). Other deep learning based models have made improvements in biomedical text mining tasks such as relation extraction (RE) (Bhasuran and Natarajan, 2018;Lim and Kang, 2018) and question answering (QA) (Wiese et al., 2017).
However, directly applying state-of-the-art NLP methodologies to biomedical text mining has limitations. First, as recent word representation models such as Word2Vec (Mikolov et al., 2013), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) are trained and tested mainly on datasets containing general domain texts (e.g. Wikipedia), it is difficult to estimate their performance on datasets containing biomedical texts. Also, the word distributions of general and biomedical corpora are quite different, which can often be a problem for biomedical text mining models. As a result, recent models in biomedical text mining rely largely on adapted versions of word representations (Habibi et al., 2017;Pyysalo et al., 2013).
In this study, we hypothesize that current state-of-the-art word representation models such as BERT need to be trained on biomedical corpora to be effective in biomedical text mining tasks. Previously, Word2Vec, which is one of the most widely known context independent word representation models, was trained on biomedical corpora which contain terms and expressions that are usually not included in a general domain corpus (Pyysalo et al., 2013). While ELMo and BERT have proven the effectiveness of contextualized word representations, they cannot obtain high performance on biomedical corpora because they are pre-trained on only general domain corpora. As BERT achieves very strong results on various NLP tasks while using almost the same structure across the tasks, adapting BERT for the biomedical domain could potentially benefit numerous biomedical NLP researches.

Approach
In this article, we introduce BioBERT, which is a pre-trained language representation model for the biomedical domain. The overall process of pre-training and fine-tuning BioBERT is illustrated in Figure 1. First, we initialize BioBERT with weights from BERT, which was pretrained on general domain corpora (English Wikipedia and BooksCorpus). Then, BioBERT is pre-trained on biomedical domain corpora (PubMed abstracts and PMC full-text articles). To show the effectiveness of our approach in biomedical text mining, BioBERT is fine-tuned and evaluated on three popular biomedical text mining tasks (NER, RE and QA). We test various pre-training strategies with different combinations and sizes of general domain corpora and biomedical corpora, and analyze the effect of each corpus on pre-training. We also provide in-depth analyses of BERT and BioBERT to show the necessity of our pre-training strategies.
The contributions of our paper are as follows: • BioBERT is the first domain-specific BERT based model pretrained on biomedical corpora for 23 days on eight NVIDIA V100 GPUs. • We show that pre-training BERT on biomedical corpora largely improves its performance. BioBERT obtains higher F1 scores in biomedical NER (0.62) and biomedical RE (2.80), and a higher MRR score (12.24) in biomedical QA than the current state-ofthe-art models.
• Compared with most previous biomedical text mining models that are mainly focused on a single task such as NER or QA, our model BioBERT achieves state-of-the-art performance on various biomedical text mining tasks, while requiring only minimal architectural modifications. • We make our pre-processed datasets, the pre-trained weights of BioBERT and the source code for fine-tuning BioBERT publicly available.

Materials and methods
BioBERT basically has the same structure as BERT. We briefly discuss the recently proposed BERT, and then we describe in detail the pre-training and fine-tuning process of BioBERT.

BERT: bidirectional encoder representations from transformers
Learning word representations from a large amount of unannotated text is a long-established method. While previous models (e.g. Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014)) focused on learning context independent word representations, recent works have focused on learning context dependent word representations. For instance, ELMo (Peters et al., 2018) uses a bidirectional language model, while CoVe (McCann et al., 2017) uses machine translation to embed context information into word representations.
BERT (Devlin et al., 2019) is a contextualized word representation model that is based on a masked language model and pretrained using bidirectional transformers (Vaswani et al., 2017). Due to the nature of language modeling where future words cannot be seen, previous language models were limited to a combination of two unidirectional language models (i.e. left-to-right and right-toleft). BERT uses a masked language model that predicts randomly masked words in a sequence, and hence can be used for learning bidirectional representations. Also, it obtains state-of-the-art performance on most NLP tasks, while requiring minimal task-specific architectural modification. According to the authors of BERT, incorporating information from bidirectional representations, rather than unidirectional representations, is crucial for representing words in natural language. We hypothesize that such bidirectional representations are also critical in biomedical text mining as complex relationships between biomedical terms often exist in a biomedical corpus (Krallinger et al., 2017). Due to the space limitations, we refer readers to Devlin et al. (2019) for a more detailed description of BERT.

Pre-training BioBERT
As a general purpose language representation model, BERT was pretrained on English Wikipedia and BooksCorpus. However, biomedical domain texts contain a considerable number of domain-specific proper nouns (e.g. BRCA1, c.248T>C) and terms (e.g. transcriptional, antimicrobial), which are understood mostly by biomedical researchers. As a result, NLP models designed for general purpose language understanding often obtains poor performance in biomedical text mining tasks. In this work, we pre-train BioBERT on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). The text corpora used for pre-training of BioBERT are listed in Table 1, and the tested combinations of text corpora are listed in Table 2. For computational efficiency, whenever the Wiki þ Books corpora were used for pre-training, we initialized BioBERT with the pre-trained BERT model provided by Devlin et al. (2019). We define BioBERT as a language representation model whose pre-training corpora includes biomedical corpora (e.g. BioBERT (þ PubMed)).
For tokenization, BioBERT uses WordPiece tokenization (Wu et al., 2016), which mitigates the out-of-vocabulary issue. With WordPiece tokenization, any new words can be represented by frequent subwords (e.g. Immunoglobulin ¼> I ##mm ##uno ##g ##lo ##bul ##in). We found that using cased vocabulary (not lowercasing) results in slightly better performances in downstream tasks. Although we could have constructed new WordPiece vocabulary based on biomedical corpora, we used the original vocabulary of BERT BASE for the following reasons: (i) compatibility of BioBERT with BERT, which allows BERT pre-trained on general domain corpora to be re-used, and makes it easier to interchangeably use existing models based on BERT and BioBERT and (ii) any new words may still be represented and fine-tuned for the biomedical domain using the original WordPiece vocabulary of BERT.

Fine-tuning BioBERT
With minimal architectural modification, BioBERT can be applied to various downstream text mining tasks. We fine-tune BioBERT on the following three representative biomedical text mining tasks: NER, RE and QA.
Named entity recognition is one of the most fundamental biomedical text mining tasks, which involves recognizing numerous domain-specific proper nouns in a biomedical corpus. While most previous works were built upon different combinations of LSTMs and CRFs (Giorgi and Bader, 2018;Habibi et al., 2017;Wang et al., 2018), BERT has a simple architecture based on bidirectional transformers. BERT uses a single output layer based on the representations from its last layer to compute only token level BIO2 probabilities. Note that while previous works in biomedical NER often used word embeddings trained on PubMed or PMC corpora (Habibi et al., 2017;Yoon et al., 2019), BioBERT directly learns WordPiece embeddings during pre-training and fine-tuning. For the evaluation metrics of NER, we used entity level precision, recall and F1 score.
Relation extraction is a task of classifying relations of named entities in a biomedical corpus. We utilized the sentence classifier of the original version of BERT, which uses a [CLS] token for the classification of relations. Sentence classification is performed using a single output layer based on a [CLS] token representation from BERT. We anonymized target named entities in a sentence using pre-defined tags such as @GENE$ or @DISEASE$. For instance, a sentence with two target entities (gene and disease in this case) is represented as "Serine at position 986 of @GENE$ may be an independent genetic predictor of angiographic @DISEASE$." The precision, recall and F1 scores on the RE task are reported. Question answering is a task of answering questions posed in natural language given related passages. To fine-tune BioBERT for QA, we used the same BERT architecture used for SQuAD (Rajpurkar et al., 2016). We used the BioASQ factoid datasets because their format is similar to that of SQuAD. Token level probabilities for the start/end location of answer phrases are computed using a single output layer. However, we observed that about 30% of the BioASQ factoid questions were unanswerable in an extractive QA setting as the exact answers did not appear in the given passages. Like Wiese et al. (2017), we excluded the samples with unanswerable questions from the training sets. Also, we used the same pre-training process of Wiese et al. (2017), which uses SQuAD, and it largely improved the performance of both BERT and BioBERT. We used the following evaluation metrics from BioASQ: strict accuracy, lenient accuracy and mean reciprocal rank.

Datasets
The statistics of biomedical NER datasets are listed in Table 3. We used the pre-processed versions of all the NER datasets provided by Wang et al. (2018) except the 2010 i2b2/VA, JNLPBA and Species-800 datasets. The pre-processed NCBI Disease dataset has fewer annotations than the original dataset due to the removal of duplicate articles from its training set. We used the CoNLL format (https:// github.com/spyysalo/standoff2conll) for pre-processing the 2010 i2b2/VA and JNLPBA datasets. The Species-800 dataset was preprocessed and split based on the dataset of Pyysalo (https://github. com/spyysalo/s800). We did not use alternate annotations for the BC2GM dataset, and all NER evaluations are based on entity-level exact matches. Note that although there are several other recently introduced high quality biomedical NER datasets (Mohan and Li, 2019), we use datasets that are frequently used by many biomedical NLP researchers, which makes it much easier to compare our work with theirs. The RE datasets contain gene-disease relations and protein-chemical relations (Table 4). Pre-processed GAD and EU-ADR datasets are available with our provided codes. For the CHEMPROT dataset, we used the same pre-processing procedure described in Lim and Kang (2018). We used the BioASQ factoid datasets, which can be converted into the same format as the SQuAD dataset (Table 5). We used full abstracts (PMIDs) and related questions and answers provided by the BioASQ organizers. We have made the pre-processed BioASQ datasets publicly available. For all the datasets, we used the same dataset splits used in previous works (Lim and Kang, 2018;Tsatsaronis et al., 2015;Wang et al., 2018) for a fair evaluation; however, the splits of LINAAEUS and Species-800 could not be found from Giorgi and Bader (2018) and may be different. Like previous work (Bhasuran and Natarajan, 2018), we reported the performance of 10-fold cross-validation on datasets that do not have separate test sets (e.g. GAD, EU-ADR).
We compare BERT and BioBERT with the current state-of-theart models and report their scores. Note that the state-of-the-art models each have a different architecture and training procedure. For instance, the state-of-the-art model by Yoon et al. (2019) trained on the JNLPBA dataset is based on multiple Bi-LSTM CRF models with character level CNNs, while the state-of-the-art model by Giorgi and Bader (2018) trained on the LINNAEUS dataset uses a Bi-LSTM CRF model with character level LSTMs and is additionally trained on silver-standard datasets. On the other hand, BERT and BioBERT have exactly the same structure, and use only the gold standard datasets and not any additional datasets.

Experimental setups
We used the BERT BASE model pre-trained on English Wikipedia and BooksCorpus for 1M steps. BioBERT v1.0 (þ PubMed þ PMC) is the version of BioBERT (þ PubMed þ PMC) trained for 470 K steps. When using both the PubMed and PMC corpora, we found that 200K and 270K pre-training steps were optimal for PubMed and PMC, respectively. We also used the ablated versions of BioBERT v1.0, which were pre-trained on only PubMed for 200K steps (BioBERT v1.0 (þ PubMed)) and PMC for 270K steps (BioBERT v1.0 (þ PMC)). After our initial release of BioBERT v1.0, we pre-trained BioBERT on PubMed for 1M steps, and we refer to this version as BioBERT v1.1 (þ PubMed). Other hyper-parameters such as batch size and learning rate scheduling for pre-training BioBERT are the same as those for pre-training BERT unless stated otherwise.
We pre-trained BioBERT using Naver Smart Machine Learning (NSML) (Sung et al., 2017), which is utilized for large-scale experiments that need to be run on several GPUs. We used eight NVIDIA V100 (32GB) GPUs for the pre-training. The maximum sequence length was fixed to 512 and the mini-batch size was set to 192, resulting in 98 304 words per iteration. It takes more than 10 days to pre-train BioBERT v1.0 (þ PubMed þ PMC) nearly 23 days for BioBERT v1.1 (þ PubMed) in this setting. Despite our best efforts to use BERT LARGE , we used only BERT BASE due to the computational complexity of BERT LARGE .
We used a single NVIDIA Titan Xp (12GB) GPU to fine-tune BioBERT on each task. Note that the fine-tuning process is more computationally efficient than pre-training BioBERT. For finetuning, a batch size of 10, 16, 32 or 64 was selected, and a learning rate of 5eÀ5, 3eÀ5 or 1eÀ5 was selected. Fine-tuning BioBERT on QA and RE tasks took less than an hour as the size of the training data is much smaller than that of the training data used by Devlin et al. (2019). On the other hand, it takes more than 20 epochs for BioBERT to reach its highest performance on the NER datasets.

Experimental results
The results of NER are shown in Table 6. First, we observe that BERT, which was pre-trained on only the general domain corpus is quite effective, but the micro averaged F1 score of BERT was lower (2.01 lower) than that of the state-of-the-art models. On the other hand, BioBERT achieves higher scores than BERT on all the datasets. BioBERT outperformed the state-of-the-art models on six out of nine datasets, and BioBERT v1.1 (þ PubMed) outperformed the state-of-the-art models by 0.62 in terms of micro averaged F1 score. The relatively low scores on the LINNAEUS dataset can be attributed to the following: (i) the lack of a silver-standard dataset for training previous state-of-the-art models and (ii) different training/ test set splits used in previous work (Giorgi and Bader, 2018), which were unavailable.
The RE results of each model are shown in Table 7. BERT achieved better performance than the state-of-the-art model on the CHEMPROT dataset, which demonstrates its effectiveness in RE. On average (micro), BioBERT v1.0 (þ PubMed) obtained a higher F1 score (2.80 higher) than the state-of-the-art models. Also, BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets.
The QA results are shown in Table 8. We micro averaged the best scores of the state-of-the-art models from each batch. BERT obtained a higher micro averaged MRR score (7.0 higher) than the state-of-the-art models. All versions of BioBERT significantly outperformed BERT and the state-of-the-art models, and in particular, BioBERT v1.1 (þ PubMed) obtained a strict accuracy of 38.77, a lenient accuracy of 53.81 and a mean reciprocal rank score of 44.77, all of which were micro averaged. On all the biomedical QA datasets, BioBERT achieved new state-of-the-art performance in terms of MRR.

Discussion
We used additional corpora of different sizes for pre-training and investigated their effect on performance. For BioBERT v1.0 (þ PubMed), we set the number of pre-training steps to 200K and varied the size of the PubMed corpus. Figure 2(a) shows that the performance of BioBERT v1.0 (þ PubMed) on three NER datasets (NCBI Disease, BC2GM, BC4CHEMD) changes in relation to the size of the PubMed corpus. Pre-training on 1 billion words is quite effective, and the performance on each dataset mostly improves until 4.5 billion words. We also saved the pre-trained weights from BioBERT v1.0 (þ PubMed) at different pre-training steps to measure how the number of pre-training steps affects its performance on fine-tuning tasks. Figure 2(b) shows the performance changes of BioBERT v1.0 (þ PubMed) on the same three NER datasets in relation to the number of pre-training steps. The results clearly show that the performance on each dataset improves as the number of pre-training steps increases. Finally, Figure 2(c) shows the absolute performance improvements of BioBERT v1.0 (þ PubMed þ PMC) over BERT on all 15 datasets. F1 scores were used for NER/RE, and MRR scores were used for QA. BioBERT significantly improves performance on most of the datasets.

Conclusion
In this article, we introduced BioBERT, which is a pre-trained language representation model for biomedical text mining. We showed that pre-training BERT on biomedical corpora is crucial in applying it to the biomedical domain. Requiring minimal task-specific architectural modification, BioBERT outperforms previous models on biomedical text mining tasks such as NER, RE and QA. The pre-released version of BioBERT (January 2019) has already been shown to be very effective in many biomedical text mining tasks such as NER for clinical notes (Alsentzer et al., 2019), human phenotype-gene RE (Sousa et al., 2019) and clinical temporal RE (Lin et al., 2019). The following updated versions of BioBERT will be available to the bioNLP community: (i) BioBERT BASE and BioBERT LARGE trained on only PubMed abstracts without initialization from the existing BERT model and (ii) BioBERT BASE and BioBERT LARGE trained on domain-specific vocabulary based on WordPiece.   Notes: Precision (P), Recall (R) and F1 (F) scores on each dataset are reported. The best scores are in bold, and the second best scores are underlined. The scores on GAD and EU-ADR were obtained from Bhasuran and Natarajan (2018), and the scores on CHEMPROT were obtained from Lim and Kang (2018).  Notes: Strict Accuracy (S), Lenient Accuracy (L) and Mean Reciprocal Rank (M) scores on each dataset are reported. The best scores are in bold, and the second best scores are underlined. The best BioASQ 4b/5b/6b scores were obtained from the BioASQ leaderboard (http://participants-area.bioasq.org). BERT . . . a case of oral penicillin anaphylaxis is described, and the terminology . . .

BioBERT
. . . a case of oral penicillin anaphylaxis is described, and the terminology . . .

BC2GM
BERT Like the DMA, but unlike all other mammalian class II A genes, the zebrafish gene codes for two cysteine residues . . .

BioBERT
Like the DMA, but unlike all other mammalian class II A genes, the zebrafish gene codes for two cysteine residues . . .

QA
BioASQ 6b-factoid Q: Which type of urinary incontinence is diagnosed with the Q tip test? BERT A total of 25 women affected by clinical stress urinary incontinence (SUI) were enrolled. After undergoing (. . .) Q-tip test, . . .

Q: Which bacteria causes erythrasma? BERT
Corynebacterium minutissimum is the bacteria that leads to cutaneous eruptions of erythrasma . . .

BioBERT
Corynebacterium minutissimum is the bacteria that leads to cutaneous eruptions of erythrasma . . .

Note:
Predicted named entities for NER and predicted answers for QA are in bold.