- Split View
-
Views
-
Cite
Cite
Hai-Long Trieu, Makoto Miwa, Sophia Ananiadou, BioVAE: a pre-trained latent variable language model for biomedical text mining, Bioinformatics, Volume 38, Issue 3, February 2022, Pages 872–874, https://doi.org/10.1093/bioinformatics/btab702
- Share Icon Share
Abstract
Large-scale pre-trained language models (PLMs) have advanced state-of-the-art (SOTA) performance on various biomedical text mining tasks. The power of such PLMs can be combined with the advantages of deep generative models. These are examples of these combinations. However, they are trained only on general domain text, and biomedical models are still missing. In this work, we describe BioVAE, the first large-scale pre-trained latent variable language model for the biomedical domain, which uses the OPTIMUS framework to train on large volumes of biomedical text. The model shows SOTA performance on several biomedical text mining tasks when compared to existing publicly available biomedical PLMs. In addition, our model can generate more accurate biomedical sentences than the original OPTIMUS output.
Our source code and pre-trained models are freely available: https://github.com/aistairc/BioVAE.
Supplementary data are available at Bioinformatics online.
1 Introduction
Large-scale pre-trained language models (PLMs) (Beltagy et al., 2019; Lee et al., 2020) have shown state-of-the-art (SOTA) performance on various biomedical text mining tasks. These models provide contextualized representations, learned from large volumes of biomedical text, which then can be easily applied to achieve SOTA on downstream tasks such as named entity recognition (NER), relation extraction (REL) and question answering (QA) (Kim et al., 2019; Lin et al., 2019; Nentidis et al., 2019).
Combining such large-scale PLMs to train latent variables based on deep generative models (DGMs) has been shown to improve representation learning tasks (Bowman et al., 2016; Li et al., 2020). A recent framework called OPTIMUS, has successfully combined BERT-based PLMs (Devlin et al., 2019) and GPT-2 (Radford et al., 2019) with variational autoencoders (VAEs) (Kingma et al., 2013) (a powerful model of DGMs), achieving SOTA in both representation learning and language generation tasks when trained on two million Wikipedia sentences. However, the data distributions between general and biomedical domain are different, which makes it challenging to apply these models directly to biomedical text mining tasks (Lee et al., 2020). In addition, training such large-scale models on a massive amount of biomedical text is costly (Supplementary Appendix SF).
To leverage the advantages of VAE-based PLMs for biomedical text mining, we release BioVAE, the first large-scale pre-trained latent variable language model for biomedical texts. The model is trained using the OPTIMUS framework on 34 million sentences from PubMed articles. We evaluate our BioVAE model on downstream text mining tasks, i.e. NER, REL and QA, and achieve SOTA on most of the tasks when compared with previous powerful biomedical PLMs, i.e. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019) and PubMedBERT (Gu et al., 2020). For language generation, BioVAE generates more accurate biomedical sentences than the original OPTIMUS output.
2 Approach
BioVAE: We used the OPTIMUS framework with the same configurations to train a large-scale VAE language model on biomedical data. We initialize the encoder with the biomedical pre-trained SciBERT (Beltagy et al., 2019) and the decoder with the pre-trained GPT-2 (Radford et al., 2019). We illustrate our model in Supplementary Appendix SA.
Corpus: We train BioVAE on the latest biomedical abstracts from the PubMed 2021 Baseline (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/). Our data contain 34M sentences (3.35M abstracts).
Settings: We follow the same settings used in OPTIMUS. We set the latent size as 32 and 768, and beta as 0.0 and 0.5. For training on large batch sizes, we used the LAMB optimizer (You et al., 2020). We used 128 GPUs from the AI Bridging Cloud Infrastructure (ABCI https://abci.ai/), which take 3 days to train 34M sentences for one epoch.
3 Results
Tasks: The pre-trained BioVAE model is evaluated on three NER tasks, i.e. BC5CDR (Li et al., 2016), JNLPBA (Kim et al., 2004) and NCBI-disease (Doğan et al., 2014); a REL task, i.e. ChemProt (Kringelum et al., 2016); and a QA task, i.e. BioASQ (Nentidis et al., 2019). We follow the same evaluation settings used in Lee et al. (2020) and Beltagy et al. (2019).
Fine-tuning: We follow the same settings used in our baseline SciBERT model (Beltagy et al., 2019). The final BERT vectors from the encoder of our pre-trained BioVAE model are fed into a classification layer. The pre-trained model is fine-tuned for 2–5 epochs with a batch size of 32 and a learning rate of 2e-5, similarly to SciBERT’s tuning parameters. For QA, we follow the BioBERT settings and evaluation scripts.
Results: Table 1 compares our BioVAE with biomedical pre-trained SciBERT (Beltagy et al., 2019), BioBERT (Lee et al., 2020) and PubMedBERT (Gu et al., 2020) models on the NER, REL and QA tasks. Our baseline is the SciBERT since we use this model to initialize the encoder. BioVAE outperforms the SciBERT on all tasks, i.e. +0.54 F1 (JNLPBA), +0.17 F1 (BC5CDR), +1.55 F1 (for NCBI-disease) and +0.85 F1 (for ChemProt); and +3.57 accuracy (for QA) compared with BioBERT. BioVAE scores are lower than PubMedBERT in REL and QA, but better in NER tasks, and we discuss the reasons in more details in Supplementary Appendix SC.
Model . | NER . | REL . | QA . | ||
---|---|---|---|---|---|
BC5CDR . | NCBI . | JNLPBA . | |||
PubMedBERT (Gu et al., 2020) | 87.27 | 79.96 | 71.82 | 85.47 | 75.00 |
BioBERT (Lee et al., 2020) | 88.85 | 89.36 | 77.59 | 76.68 | 69.29 |
SCIBERT (Beltagy et al., 2019) | 90.01 | 88.57 | 77.28 | 83.64 | 72.14 |
BioVAE () | 89.85 | 88.85 | 77.82 | 83.68 | 72.86 |
BioVAE () | 90.10 | 88.12 | 77.69 | 83.05 | 72.14 |
BioVAE () | 89.69 | 89.80 | 77.66 | 83.54 | 72.14 |
BioVAE () | 90.18 | 90.12 | 77.57 | 84.49 | 72.86 |
Model . | NER . | REL . | QA . | ||
---|---|---|---|---|---|
BC5CDR . | NCBI . | JNLPBA . | |||
PubMedBERT (Gu et al., 2020) | 87.27 | 79.96 | 71.82 | 85.47 | 75.00 |
BioBERT (Lee et al., 2020) | 88.85 | 89.36 | 77.59 | 76.68 | 69.29 |
SCIBERT (Beltagy et al., 2019) | 90.01 | 88.57 | 77.28 | 83.64 | 72.14 |
BioVAE () | 89.85 | 88.85 | 77.82 | 83.68 | 72.86 |
BioVAE () | 90.10 | 88.12 | 77.69 | 83.05 | 72.14 |
BioVAE () | 89.69 | 89.80 | 77.66 | 83.54 | 72.14 |
BioVAE () | 90.18 | 90.12 | 77.57 | 84.49 | 72.86 |
Note: The best scores are in bold, and the scores outperforming the SciBERT baseline are underlined. We report macro F1 scores for NER, micro F1 for REL and accuracy for QA (dz: latent size).
Model . | NER . | REL . | QA . | ||
---|---|---|---|---|---|
BC5CDR . | NCBI . | JNLPBA . | |||
PubMedBERT (Gu et al., 2020) | 87.27 | 79.96 | 71.82 | 85.47 | 75.00 |
BioBERT (Lee et al., 2020) | 88.85 | 89.36 | 77.59 | 76.68 | 69.29 |
SCIBERT (Beltagy et al., 2019) | 90.01 | 88.57 | 77.28 | 83.64 | 72.14 |
BioVAE () | 89.85 | 88.85 | 77.82 | 83.68 | 72.86 |
BioVAE () | 90.10 | 88.12 | 77.69 | 83.05 | 72.14 |
BioVAE () | 89.69 | 89.80 | 77.66 | 83.54 | 72.14 |
BioVAE () | 90.18 | 90.12 | 77.57 | 84.49 | 72.86 |
Model . | NER . | REL . | QA . | ||
---|---|---|---|---|---|
BC5CDR . | NCBI . | JNLPBA . | |||
PubMedBERT (Gu et al., 2020) | 87.27 | 79.96 | 71.82 | 85.47 | 75.00 |
BioBERT (Lee et al., 2020) | 88.85 | 89.36 | 77.59 | 76.68 | 69.29 |
SCIBERT (Beltagy et al., 2019) | 90.01 | 88.57 | 77.28 | 83.64 | 72.14 |
BioVAE () | 89.85 | 88.85 | 77.82 | 83.68 | 72.86 |
BioVAE () | 90.10 | 88.12 | 77.69 | 83.05 | 72.14 |
BioVAE () | 89.69 | 89.80 | 77.66 | 83.54 | 72.14 |
BioVAE () | 90.18 | 90.12 | 77.57 | 84.49 | 72.86 |
Note: The best scores are in bold, and the scores outperforming the SciBERT baseline are underlined. We report macro F1 scores for NER, micro F1 for REL and accuracy for QA (dz: latent size).
Text generation: Given an input sequence, our model can reconstruct the input sequence. We compare sentences that have been reconstructed by both our BioVAE and OPTIMUS models in Table 2. The table shows that sentences generated by BioVAE are more accurate than the original OPTIMUS output. Further samples are presented in Supplementary Appendix SB.
Input . | BioVAE . | OPTIMUS . |
---|---|---|
Sequence analysis of CDC4/FBXW7 was carried out on gastric carcinoma cell lines and xenografts | Sequence analysis of CDC4/FBXW7 was carried out on gastric cancer cell lines and xenografts | Electrophysiological studies were performed in the brain and cerebrospinal fluid (CSF) |
Perplexity = 1.000 | Perplexity = 1.113 | Perplexity = 3.534 |
Input . | BioVAE . | OPTIMUS . |
---|---|---|
Sequence analysis of CDC4/FBXW7 was carried out on gastric carcinoma cell lines and xenografts | Sequence analysis of CDC4/FBXW7 was carried out on gastric cancer cell lines and xenografts | Electrophysiological studies were performed in the brain and cerebrospinal fluid (CSF) |
Perplexity = 1.000 | Perplexity = 1.113 | Perplexity = 3.534 |
Input . | BioVAE . | OPTIMUS . |
---|---|---|
Sequence analysis of CDC4/FBXW7 was carried out on gastric carcinoma cell lines and xenografts | Sequence analysis of CDC4/FBXW7 was carried out on gastric cancer cell lines and xenografts | Electrophysiological studies were performed in the brain and cerebrospinal fluid (CSF) |
Perplexity = 1.000 | Perplexity = 1.113 | Perplexity = 3.534 |
Input . | BioVAE . | OPTIMUS . |
---|---|---|
Sequence analysis of CDC4/FBXW7 was carried out on gastric carcinoma cell lines and xenografts | Sequence analysis of CDC4/FBXW7 was carried out on gastric cancer cell lines and xenografts | Electrophysiological studies were performed in the brain and cerebrospinal fluid (CSF) |
Perplexity = 1.000 | Perplexity = 1.113 | Perplexity = 3.534 |
4 Conclusion
We have described BioVAE, the first large-scale pre-trained latent variable language model for the biomedical domain. The model is trained using the OPTIMUS framework on large volumes of biomedical text. We achieve SOTA when evaluating the model on text mining tasks such as NER, REL and QA. Our results provide strong evidence that it will be possible to apply the BioVAE model to further biomedical tasks in the future.
Acknowledgement
The authors thank Khoa N. A. Duong for the invaluable support in implementing and evaluating the models.
Funding
This paper is based on results obtained from a project, JPNP20006, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This research is also supported by the Alan Turing Institute and BBSRC, Japan Partnering Award, BB/P025684/1.
Conflict of Interest: none declared.