Improving large language models for clinical named entity recognition via prompt engineering

Abstract Importance The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models’ performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets. Objectives This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. Materials and Methods We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. Results Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. Discussion The study’s findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings. Conclusion While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.


Introduction
Electronic health records (EHRs) contain a vast quantity of unstructured data, including clinical notes, which can offer valuable insights for patient care and clinical research [1].However, manually extracting pertinent information from clinical notes presents a challenge, as it is labor-intensive and time-consuming.To address these challenges, researchers have developed various natural language processing (NLP) techniques for automating the clinical information extraction process.Clinical named entity recognition (NER) is a critical clinical NLP task focusing on recognizing boundaries of clinical entities (i.e., words/phrases) and determining their semantic categories, such as medical problems, treatment, and tests [2].With the help of advancements in clinical NER, the time and effort required for manual chart review and coding by health professionals can be significantly reduced, thus improving patient care efficiency and accelerating clinical research [3].
Early clinical NER systems are often dependent on predefined lexical resources and syntactic/semantic rules derived from extensive manual analysis of text [4].Over the past decade, machine learning-based approaches have gained popularity in clinical NER research [5].Current popular clinical information extraction systems, such as cTAKES and CLAMP, are hybrid systems that integrate rule-based and machine learning-based techniques [6].Nevertheless, a bottleneck in building machine learning-based clinical NER models is to develop large, annotated corpora, which often require domain experts and take a long time to build.More recently, transformer-based large language models have emerged as the leading method for developing clinical NLP applications.Bidirectional Encoder Representations from Transformers (BERT) is a widely used pre-trained language model that learns contextual representations of free text [7].Utilizing BERT as the foundation, domain-specific language models like BioBERT, PubMedBERT (trained on biomedical literature), and ClinicalBERT (trained on the MIMIC-III dataset) have been further developed [8,9,10].These models have been applied to clinical NER tasks via transfer learning (i.e., fine-tuning the models on clinical NER corpora), and have shown improved performance with fewer annotated samples [8,9,10].
Generative Pre-trained Transformers (GPT) represent another type of large language model capable of generating human-like responses based on textual input.In November 2022, OpenAI unveiled GPT-3.5 [11], a groundbreaking chatbot driven by the GPT-3.5 language model that quickly garnered interest from researchers and technology enthusiasts.As an extension of GPT-3, GPT-3.5 serves as a conversational agent adept at following complex instructions and generating high-quality responses across various scenarios.Besides its conversational skills, GPT-3.5 has exhibited remarkable performance in many other NLP tasks, such as machine translation and question-answering [12], even in the zero-shot or few-shot learning scenarios [13], where the model can be applied to new tasks without any fine-tuning or with fine-tuning using a very small amount of data.On March 18th, 2023, OpenAI released GPT-4, one of the most advanced NLP models at the time, which has demonstrated even greater capabilities and performance improvements over GPT-3.5 [14].
As interest in GPT models continues to surge, numerous studies are currently exploring the wide range of possibilities offered by these large language models.One prominent example of GPT models for medicine is that GPT-3.5 passed the US medical license exam with about 60% accuracy, which has further sparked the potential use of GPT-3.5 and GPT-4 in the medical domain [15].More applications of GPT-3.5 and GPT-4 in healthcare have also been discussed [16,17,18,19,20,21,22,23]. With those motivations, this study aims to investigate the potential of GPT models for clinical NER tasks.
Meanwhile, prompt engineering has emerged as a crucial aspect of utilizing GPT models effectively for various NLP tasks.Prompt engineering involves designing input prompts that guide the model to generate desired outputs, thereby improving its performance on specific tasks [24].Several studies have explored prompt engineering for GPT models in open-domain settings, demonstrating its effectiveness in enhancing the model's performance across a range of tasks [25,26].In the biomedical domain, some work has been done on prompt engineering for GPT models, focusing on tasks such as biomedical question-answering, text classification and NER [27,28,29].However, to the best of our knowledge, no work has been conducted on prompt engineering for GPT models specifically targeting NER tasks in clinical texts.This highlights the need for further investigation into the potential of GPT models and prompt engineering techniques for clinical NER applications.
The contributions of this study are threefold.First, we proposed a prompt framework for clinical NER by incorporating entity definitions, annotation guidelines, and annotated samples, and demonstrated its effectiveness on two NER tasks (e.g., improving the performance of the GPT models by up to 20% and making is more competitive to fine-tuned models such as BioClinicalBERT).Second, we discussed how the recent LLMs such as GPT models will change the development of NER systems in the medical domain.This is important because LLMs shows a great potential for developing generalizable clinical NER systems without substantial annotation efforts.Finally, this study also established a novel benchmark to evaluate the performance of the LLMs, GPT-3.5 and GPT-4, for the task of clinical NER.We leveraged two distinct clinical NER tasks as benchmarks, namely the 2010 i2b2 concept extraction task [30] and the nervous system disorder-related event extraction task [31].All code and datasets are made publicly available to the community.

Task Overview
This study aims to assess the zero-shot capability of GPT-3.5 based ChatGPT (in the following discussion, we will use ChatGPT to refer to GPT-3.5 based ChatGPT) in the clinical NER task, as defined in the 2010 i2b2 challenge [30].We compared the performance of ChatGPT and GPT-3 in a similar zero-shot setting and included a baseline model, BioClinicalBERT, which was trained on the 2010 i2b2 dataset detailed below.The primary workflow of our investigation is depicted in Figure ??.Two different prompts were crafted to identify three types of clinical entities: Medical Problem, Treatment, and Test from clinical text using both ChatGPT and GPT-3.Additionally, we trained a supervised BioClinicalBERT model using an annotated corpus from the 2010 i2b2 challenge, as a baseline.All three models were then evaluated using an annotated corpus consisting of HPI sections from 100 discharge summaries in the MTSamples collection (see next section).
Figure 1: An overview of the study workflow.

Dataset
Two clinical NER datasets were used in our study, including (1) MTSamples, a set of 163 fully synthetic discharge summaries from MTSamples, which was annotated according to the annotation guidelines from the 2010 i2b2 challenge, which aims at extracting Medical Problem, Treatment, and Test [30]; (2) the VAERS corpus, a set of 91 publicly available safety reports in VAERS, aiming at extracting nervous system disorder-related events [31].The MTSamples dataset is fully synthetic, meaning that it has been artificially generated and contains no real patient information.The VAERS dataset, on the other hand, is derived from publicly available post-market safety reports that are anonymized and do not contain personally identifiable information.So, no sensitive data was sent to OpenAI API, making this study free of privacy concerns.Upon consultation, it was determined that our study did not require an IRB approval.The two datasets were split into training, validation, and test subsets.The training and validation subsets served the purpose of fine-tuning the BioClinicalBERT model.Annotated samples in prompts were randomly sampled from the training sets.The training sets were also used for error analysis to optimize our prompt strategies.The test subsets, however, were reserved exclusively for evaluating the final performance and for comparative analysis.A descriptive statistic of entities in these datasets is presented in Table 1.

Models
We fine-tuned NER models using BioClinicalBERT [32], to serve as baselines of traditional supervised learning approaches.We present results for supervised learning on both the MTSamples test set and the VAERS test set.The model weights were initialized using the transformers package, available at huggingface1 [33].The hyperparameters employed during model training included a learning rate of 5e-5, a training batch size of 4, 20 epochs, and a weight de- cay of 0.01 using the AdamW optimizer [34].In addition to fine-tuning NER models using BioClinicalBERT, we also employed a traditional machine learning approach for comparison.We utilized a Conditional Random Field (CRF) model with word features, including Bag-of-word, capitalization of letters in words, and prefixes and suffixes of words [35].
Temperature in a generative language model refers to a parameter that controls the randomness in the model's predictions, typically ranging from 0 (completely deterministic) to 1 or higher (increasingly random and diverse outputs).
The temperature parameter for GPT models was set to 0 to minimize randomness in response generation.A lower temperature value restricts the model's tendency to take creative leaps, thereby ensuring more predictable and consistent outputs.This is crucial in clinical NER tasks where accuracy and reliability of information extraction are paramount.
In our setup, the GPT models were interacted with in a 'user' role.This role simulates a real-world user interaction with the model, where the 'user' inputs prompts and the model generates responses accordingly.This approach reflects a typical use-case scenario for these models in practical applications.All input and output datasets along with prompt variants are included with Jypter notebooks that can interface with the OpenAI API in our GitHub repository.At the time of this study, costs of GPT-3.5 per 1k tokens were approximately $0.03 for input and $0.06 for output.Costs of GPT-4 per 1k tokens were approximately $ 0.001 for input and $ 0.002 for output.Because of privacy issues, notes containing Personal Identifiable Information (PII) could not be used in this experiment and should not be used with the GPT API.

Prompt engineering
For GPT models, we proposed a task-specific prompt including the following components: (1) Baseline prompt with task description and format specification: This component provides the LLMs with basic information about the tasks we are instructing them to perform and in what format the LLMs should output results.We instructed the models to highlight the named entities within an HTML file using <span>tags with a class attribute indicating the entity types.This allows the output from GPT models to be easily converted into a traditional Inside-Outside-Beginning (IOB) format, which allows for a direct comparison of NER performance with findings from existing studies.
(2) Annotation guideline-based prompts: This component contains entity definitions and linguistic rules derived from annotation guidelines.Entity definitions offer comprehensive, unambiguous descriptions of an entity within the context of a given task.They play an instrumental role in steering the LLM toward the precise identification of entities within text documents.We noticed that the model's predictions often differed substantially from the gold standard in terms of grammatical structure.For example, discrepancies may arise concerning what types of phrases to be included (e.g., noun phrases or adjective phrases).To enhance the model's performance, we referred to and incorporated rules in the annotation guidelines to address these issues.
(3) Error analysis-based instructions: In addition to the original annotation guidelines, we also incorporated additional guidelines following error analysis of GPT outputs using the training data.For example, we noticed that GPT models often tend to annotate consultation procedures as test entities.To prevent this, we incorporated a specific rule stating, "Consultation procedures should not be annotated as tests.".
(4) Annotated samples: To further assist the LLMs in understanding the task and generating accurate results, we provided a set of annotated samples to improve its performance in a few-shot learning setting.We randomly selected either 1 or 5 annotated examples (1 or 5-shot learning) from the training set and formatted them according to the task description and entity markup guide.For instance, given a sentence 'He had been diagnosed with osteoarthritis of the knees and had undergone arthroscopy years prior to admission .' with 'osteoarthritis of the knees' and 'arthroscopy' annotated as medical problem and test entities , we incorporated this sentence into the prompt using the following format: ### Examples Example Input: He had been diagnosed with osteoarthritis of the knees and had undergone arthroscopy years prior to admission .Example Output: He had been diagnosed with <span class="problem">osteoarthritis of the knees</span>and had undergone <span class="test">arthroscopy</span>years prior to admission .
We compared the effectiveness of different prompt components by incrementally incorporating annotation guidelinebased prompts, error analysis-based instructions and annotated samples as shown in Table 2 (see the complete prompts for two datasets in supplementary materials S1.1).

Evaluation
The performance of the models was evaluated using Precision (P), Recall (R), and F1 scores, following the same evaluation script in the 2010 i2b2 challenge [30].These scores were computed based on both exact-match and relaxedmatch criteria.In the context of an exact match, an extracted entity should have identical token boundary and entity type as that in the gold standard.For relaxed-match, an extracted entity that exhibits overlap in text and shares the same entity type with the gold standard is acceptable.

Zero-shot performance with different prompts
The performance evaluation of GPT-3.5 and GPT-4 in zero-shot settings using different prompts are detailed in Table 3 and Figure 2. Following the integration of annotation guideline-based prompts and error analysis-based instructions, we noticed an improvement in the performance metrics of both GPT models, across each dataset and under each evaluation criteria.Interestingly, we found these two components to have a more pronounced effect on the performance of GPT-3.5 than on GPT-4.More specifically, GPT-3.5 demonstrated an average increase of 0.09 in overall F1 scores, ranging from 0.04 to 0.14.Conversely, GPT-4 displayed a more restrained average improvement of 0.06, with a range of 0.01 to 0.10.Looking at the dataset-specific effects, these two components had a more substantial impact on the VARES dataset compared to the MTSamples dataset.For VARES, we saw an average increase of approximately 0.11, with a range from 0.09 to 0.14.In contrast, for MTSamples, we saw a more modest approximate average increase of 0.04, with the range extending from 0.01 to 0.08.

Examples
(1) Baseline prompts ### Task Your task is to generate an HTML version of an input text, marking up specific entities related to healthcare.The entities to be identified are: 'medical problems', 'treatments', and 'tests'.Use HTML <span>tags to highlight these entities.Each <span>should have a class attribute indicating the type of the entity.### Entity Markup Guide Use <span class="problem">to denote a medical problem. . .4: 0-, 1-and 5-shot performance of GPT-3.5-turbo-0301 and GPT-4-0314 using all prompt components.

Performance Comparison to Supervised Learning
Table 5 displays the performance of BioClinicalBERT, CRF, GPT-3.5, and GPT-4 models for comparison.Among the three models, BioClinicalBERT still demonstrated the highest performance.For MTSamples, it achieved overall F1 scores of 0.785 and 0.901 under exact-match and relaxed-match respectively.Its performance on the VAERS dataset also remained dominant, with overall F1 scores of 0.668 and 0.802 under exact-match and relaxed-match respectively.
The CRF model achieved an F1 score of 0.584 and 0.525 in MTSamples and VAERS by exact match and surpassing GPT-3.5.In the relaxed-match criteria, the CRF model performed worse than GPT-4 and GPT-3.5 in the MTSamples and had comparable performance to GPT-3.5 in the VAERS dataset.Comparatively, GPT-3.5 lagged on two datasets with the lowest performance, yet still demonstrated a decent performance with scores of 0.794 and 0.676, as evaluated by relaxed-match criteria on the two datasets respectively.GPT-4 showcased highly competitive performance using the relaxed match criteria, accomplishing F1 scores of 0.861 and 0.736 on the MTSamples and VAERS datasets respectively.It is notable, however, that the performances of GPT-3.5 and GPT-4 as evaluated by the exact-match method were not as impressive as those by the relaxed-match.In addition to the test set results, we have provided the model's performance on the validation sets in supplementary materials S1.2 to ensure the BioClinicalBERT is not overfitting.

Error analysis
A random sample of 20 sentences was selected from the outputs generated by each GPT model across the two datasets, post-processing.This selection included sentences with both false positives and false negatives.The error analysis was conducted based on exact match.The error statistics derived from this analysis are presented in Figure 5.When assessed on a dataset basis, GPT-3.5 and GPT-4 exhibited similar error patterns for the MTSamples dataset.Both models encountered challenges when it came to identifying correct entity boundaries.This typically involved making decisions on whether to include article words (such as 'the' in the phrase 'the study drug') or modifiers (such as 'another large' in the phrase 'another large stroke') that precede a noun phrase.In assessing model performance, we considered the exact-match criteria, which may present a different challenge for GPT models compared to BioClinical-BERT.While BioClinicalBERT is fine-tuned specifically on annotated entities with clear boundaries, the GPT models, being large language models, are trained on a broader and more diverse corpus.This distinction could impact their ability to adhere strictly to the exact boundaries of entities as defined in the training data, especially in the context of clinical NER where the linguistic structure and terminology are highly specialized.As for the VAERS dataset, several factors may contribute to its increased complexity.Firstly, inner-annotator agreement was lower compared to the MT-Samples dataset (i.e., average F1 0.7707 [31] vs 0.8620), indicating less consistency in annotations.Additionally, the VAERS dataset contains more semantically specific annotation categories, such as distinguishing between different types of adverse events.This specificity demands a higher level of contextual understanding from the models.On the other hand, GPT-4's major difficulties lie in determining the correct entity boundaries and accurately classifying the entity types.This discrepancy can be attributed to the unique characteristics of each dataset.The VAERS dataset contains more complex entities (i.e., Nervous adverse events vs Other adverse events) compared to the MTSamples dataset, leading to a higher error rate in entity type classification for the models.Another possible reason could be the inconsistency [31] in annotation, which needs further investigation.Our study hints at the as-yet unrealized potential of LLMs in clinical NER tasks by proposing a clinical task-specific prompt framework that incorporates annotation guidelines, error analysis-based instructions, and few-shot examples.We found that the performance of GPT models improved with the task-specific prompts.The best performance achieved by GPT-4 shows a competitive performance as that of BioClinicalBERT in the relaxed-match criteria.LLMs are making paradigm-shifting changes in NLP research and development.Our finding shows a quick and easy path to build more generalizable clinical NER systems by leveraging LLMs.This will significantly change our current practice in clinical NLP.Traditionally, to build a machine learning or deep learning-based NER system for specific types of clinical entities, we have to build an annotated corpus of clinical documents, which is time-consuming and costly, as it often requires medical domain experts.Remarkably, our research shows that LLMs, devoid of further model training or fine-tuning, have exhibited exceptional performance.With merely 1-or 5-shot annotated samples, these models can achieve performance that is close to the fine-tuned models that require hundreds of training samples.This suggests a potential reduction in some of the costs associated with clinical NER system development, particularly in the areas of data annotation.However, it is important to note that this does not eliminate the need for expert input in creating annotation guidelines and in the initial phases of model training.While our study demonstrates that GPT models can achieve competitive performance with fewer annotated examples compared to traditional NLP systems, the role of subject matter experts remains crucial.Experts are needed to write precise annotation guidelines, perform initial annotations for error analysis and example generation, and validate the model's performance.Although the GPT models require fewer annotated instances, the costs associated with expert involvement, API usage, and running an LLM service should not be overlooked.A comprehensive comparison of resource requirements and costs between traditional NLP systems, word embedding models, and LLM-based systems would be valuable for future studies.This will provide a clearer understanding of the practical implications and feasibility of deploying LLMs in clinical NER tasks.Moreover, our approach is generalizable -it shows consistent performance improvements across two different clinical NER tasks.The emergent abilities of LLMs [36] have been further demonstrated in multiple clinical NER tasks here, indicating the feasibility of building one large model for diverse information extraction tasks in the medical domain, which is very appealing.With those changes in mind, an urgent need will be to re-design the workflow for developing clinical NER systems using LLMs.The prompt framework for those two clinical NER tasks is the first step toward this direction and it sheds some lights for several aspects that are worth considering.The first aspect is how to clearly define an information extraction task.Our experiments show specific annotation guidelines are very helpful, which indicates medical knowledge (either in a knowledge base or from human experts) are still critical in LLMs-based NER systems and how to obtain and represent task-specific knowledge in prompts need further investigation.We also demonstrated that supplying annotated examples is effective for performance improvement.Nevertheless, how to select informative and representative samples have not been investigated in this study and other advanced few-shot learning algorithms could be explored.Another important issue is evaluation.In this study, we instructed GPT models to output entities following traditional NER approaches so that we can evaluate them using the previous evaluation scripts.However, we would argue that the current evaluation schema for NER may not be ideal for LLMs-based systems.GPT models, due to their generative nature and extensive pre-training on diverse text corpora, exhibit a nuanced understanding of context and language structure.This enables them to interpret and generate text in a way that sometimes extends beyond the strict boundaries of predefined entity classes.For instance, GPT models often recognized lab tests with abnormal values (e.g., "a blood sugar level of 40" or "white blood cell count of 23,500") as medical problems.While this interpretation is contextually relevant and clinically meaningful, it deviates from the strict entity definitions used in our evaluation, leading to apparent mismatches.Therefore, a better evaluation schema would be needed to assess LLMs performance more accurately.Despite the promising results, our study has some limitations.First, we limited LLMs to GPT models in this study.In future, we will include other popular LLMs such as LLaMA and Falcon [37,38,39].Second, our few-shot learning approaches were relatively simple, and we plan to investigate other approaches such as the chain-of-thoughts method [40,41,42], hoping to yield better results.This is one of the first studies that systematically investigated GPT models for clinical NER via prompt engineering.In this study, we proposed a clinical task-specific prompt framework by incorporating annotation guidelines, error analysis-based instructions, and annotated samples via few-shot learning, and our evaluation on two clinical NER tasks show that the GPT-4 model with our proposed prompts achieved close performance as the state-of-the-art Bio-ClinicalBERT model.The best performance achieved by GPT-4 with 5-shot learning did not work as well as the BioClinicalBERT model on MTSamples and VAERS datasets.Nevertheless, considering almost no training data was used in GPT, its performance is impressive hints the potential of LLMs in clinical NER tasks.While the results demonstrate a promising direction, they also underscore the need for further refinement and development before LLMs can consistently outperform established models like BioClinicalBERT in these specific applications.
Medical Problems are defined as: phrases that contain observations made by patients or clinicians about the patient's body or mind that are thought to be abnormal or caused by a disease. . .### Annotation Guidelines: Only complete noun phrases (NPs) and adjective phrases (APs) should be marked.Terms that fit concept semantic rules, but that are only used as modifiers in a noun phrase should not be marked. . .(3) Error analysis-based instructions ### Error-analysis-based Guidelines: Consultation procedures should not be annotated as tests. . .(4) Annotated samples via few-shot learning ### Examples Example Input1: He had been diagnosed with osteoarthritis of the knees and had undergone arthroscopy years prior to admission .Example Output1: He had been diagnosed with <span class="problem">osteoarthritis of the knees</span>and had undergone <span class="test">arthroscopy</span>years prior to admission. . .

Figure 2 :
Figure 2: Performance comparison using different prompt strategies.

Figure 3 :
Figure 3: Performance comparison based on different numbers of N-shot examples in each prompt design.

Table 1 :
Dataset statistics utilized in this study.

Table 2 :
An Illustration of the prompt framework for clinical NER.

Table 3 :
Performance of BioClinicalBERT and zero-shot performance of ChatGPT and GPT-3 on MTSamples dataset.

Table 4 and
Figure 3 illustrate the performance comparison among different numbers of N-shot examples with all prompt components included.Generally, the inclusion of more examples leads to better model performance.A combination of 5-shot and all prompts produced the best results by GPT-4, achieving F1 0.593 and 0.861 for MTSamples and 0.542, 0.736 for VAERS under exact-and relaxed-match respectively.

Table 5 :
Performance of BioClinicalBERT, CRF, GPT-3.5, and GPT-4 on MTSamples and VAERS datasets.The performance is shown in the order of Precision/Recall/F1.