BioLORD-2023: semantic textual representations fusing large language models and clinical knowledge graph insights

Abstract Objective In this study, we investigate the potential of large language models (LLMs) to complement biomedical knowledge graphs in the training of semantic models for the biomedical and clinical domains. Materials and Methods Drawing on the wealth of the Unified Medical Language System knowledge graph and harnessing cutting-edge LLMs, we propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences, consisting of 3 steps: an improved contrastive learning phase, a novel self-distillation phase, and a weight averaging phase. Results Through rigorous evaluations of diverse downstream tasks, we demonstrate consistent and substantial improvements over the previous state of the art for semantic textual similarity (STS), biomedical concept representation (BCR), and clinically named entity linking, across 15+ datasets. Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages and finetuned on 7 European languages. Discussion Many clinical pipelines can benefit from our latest models. Our new multilingual model enables a range of languages to benefit from our advancements in biomedical semantic representation learning, opening a new avenue for bioinformatics researchers around the world. As a result, we hope to see BioLORD-2023 becoming a precious tool for future biomedical applications. Conclusion In this article, we introduced BioLORD-2023, a state-of-the-art model for STS and BCR designed for the clinical domain.


APPENDIX A: NEGATIVE RESULTS
In this section, we report on alternative designs that we considered while developing our BioLORD pipeline.These results provide some insights into the limitations of our approach and suggest directions for future work.
-We initially hypothesized that generating definitions for the concepts using the knowledge graph and LLMs would replace the need for textual descriptions sampled from the knowledge graph.However, our experiments showed that the textual descriptions still contributed positively to the performance of our pipeline, as they enforced an attraction between the concepts and their parent concept names (which we used in the template).Therefore, we decided to use both the definitions and the textual descriptions in our final pipeline.
-We also investigated the effect of using different biomedical models as base models for our BioLORD models.We expected that these models would have an advantage over general-purpose language models fine-tuned on STS, as they were pre-trained on large-scale biomedical corpora.However, our results did not confirm this expectation, as biomedical models performed worse than the fine-tuned models.We attribute this to a lack of sentence understanding in the biomedical models, which led them to misunderstand definitions and overfit on shared tokens instead of semantic similarity.Thus, we concluded that base models with STS pre-training are critical for producing BioLORD models.
-Finally, we attempted to include hard mining in our pipeline.We implemented hard mining based on the ontological knowledge graph, by choosing siblings or siblings of ancestors (after ensuring they were not ancestors of the current concept themselves) as hard negatives.While hard mining improved NEL, it degraded semantics; it seems that hard mining was too aggressive and penalized concepts that were not very dissimilar.We think that a more sophisticated approach than hard mining would be needed in this case, with a margin which depends on the two concepts being compared.

APPENDIX B: DETAILS ON OUR EXPERIMENTAL SETUPS
In this section, we summarize the details of the experiments described in the results section.These details are mainly aimed at helping other researchers willing to replicate our experiments, but they might be insightful on their own to understand the scale of the work.

Contrastive phase
We leveraged our recently-upgraded hardware and ran our new contrastive phase experiments on an NVIDIA A40 GPU with 48Gb memory, up from the 32 Gb of the V100 GPU used in the BioLORD-2022 experiments.As a result, we increased the batch size from 96 to 128.We let our experiments run for about 9 days, corresponding to one epoch over our combined dataset.
This duration is very close to our previous experiments (7 days) and is the result of a careful balance between the above-mentioned hardware upgrade and our increased demand for flops.We reuse the best-performing hyperparameters from our past experiments (AdamW for the optimizer, WarmupLinear for the scheduler, 2e-5 for the learning rate, 5% of the data for the warmup window, 0.01 for the weight decay, 1 for the number of epochs, PyTorch 2.0 AMP for the mixed-precision training).
We release the code of our unsupervised contrastive training for easier replication of our results.

Self-distillation phase
We ran our self-distillation experiments on a single NVIDIA 3090 GPU, given the low memory requirements and relatively short training time required for this operation.Each training run takes about 5 hours (1 hour per epoch), with the default hyperparameters settings (AdamW for the optimizer, WarmupLinear for the scheduler, 2e-5 for the learning rate, 5% of the data for the warmup window, 0.01 for the weight decay, 5 for the number of epochs, PyTorch 1.7.1 AMP for the mixed-precision training).We also release the code of our supervised training to facilitate the replication of our results.

Weight averaging phase
The weight averaging phase implements the Greedy Model Soup strategy described in the original paper [44].In our experiment, we found that merging 7 closely-related fine-tuned models performed best among the configurations which were attempted.We release the resulting average weights as our final BioLORD-2023 model.The training data consists of all UMLS concept names, as well as parallel translation pairs for SNOMED-CT terms in the regional languages it supports (i.e.1428k pairs in Spanish, 703k in French, 452k in German, 1444k in Dutch, 284k in Danish, and 412k in Swedish).

Cross-Lingual distillation
We leave for a future work the usage of UMLS synonyms in languages other than English, and the addition of definitions in the distillation procedure.

APPENDIX C: DESCRIPTION OF EVALUATION DATASETS Clinical Semantic Textual Similarity
MedSTS [46] is a dataset which was developed for evaluating clinical semantic textual similarity.It contains 1,068 sentence pairs which were annotated by two medical experts with semantic similarity scores of 0-5 (low to high similarity).

MedNLI
[47] is a dataset initially developed for evaluating natural language inference reasoning over the clinical domain.It is curated by doctors tasked with providing three statements (one entailed, one contradicted, and one neutral) grounded in the medical history of a given patient.We report the proportion of hypothesis statements which are more similar to their entailed statement than their contradictory statement.
BIOSSES [48Error!Reference source not found.] is a biomedical semantic similarity dataset containing 100 sentence pairs and which focuses on scientific articles in the biomedical domain, rather than clinical notes.It is a challenging dataset because of the length of its entries, which often contain several subsentences.
SICK [49] is a dataset which consists of about 10k English sentence pairs, designed to be rich in lexical, syntactic, and semantic phenomena.Pairs have been annotated for relatedness on a 0-5 scale.[50] is a dataset which regroups several other general-purpose text similarity datasets (and contains 8628 sentence pairs).It was developed as a public benchmark for the first shared task of SemEval-2017, a workshop focusing on the evaluation of semantic models.

EHR-RelB
[53] is a dataset containing 3630 concept pairs sampled from electronic health records, rated for relatedness by 3 doctors.

UMNSRS
[54] is a pair of datasets, consisting of 725 clinical term pairs whose semantic similarity and relatedness were determined on a continuous scale by 4 clinicians.

MayoSRS
[55] is a dataset formed by 101 clinical term pairs whose relatedness was reported on a 4-point scale by nine medical coders and three physicians.

Biomedical Named Entity Linking
TwiMed [57] provides a comparable corpus of texts from PubMed (abstracts) and Twitter (posts), allowing pharmacovigilance researchers to better understand the similarities and differences between the language used to describe disease and drug-related symptoms on PubMed (TwiMed-PM, clinical domain) and Twitter (TwiMed-TW, social media domain).Both sets of data contain 1000 samples.

APPENDIX D: RESULTS FOR ALL MODELS
In the table below, the scores of two additional models (GatorTron-Base and MedCPT-Query) have been added for completion.Despite being a much larger model, GatorTron does not outperform BioLORD models for most of the tasks.
We ran our cross-lingual distillation experiments on a single NVIDIA 3090 GPU, given the low memory requirements and relatively short training time required for this operation.Each training run takes about 40 hours (4h per epoch), with the default hyperparameters settings (AdamW for the optimizer, WarmupLinear for the scheduler, 2e-5 for the learning rate, 5% of the data for the warmup window, 0.01 for the weight decay, 10 for the number of epochs, PyTorch 1.7.1 AMP for the mixed-precision training).

SMM4H
[58]  is a dataset for Adverse Drug Event (ADE) normalization.It was used in the SMM4H 2020 shared task on ADE normalization.The aim of the subtask was to recognize ADE mentions from tweets and normalize them to their preferred term in the MedDRA ontology.The dataset includes 1212 tweets.PsyTAR[59]  contains patients' expression of effectiveness and adverse drug events associated with psychiatric medications, originating from a sample of 891 drugs reviews posted by patients on an online healthcare forum.CADEC[60]  is a corpus of user-generated reviews of drugs that has been annotated with adverse drug events (ADEs) and their normalization.It contains 1250 posts from a medical forum, which were annotated by a team of experts from the University of Arizona.

Table D1 :
Performance characteristics of state-of-the-art biomedical models on STS (Pearson correlation),