Consistency enhancement of model prediction on document-level named entity recognition

Abstract Summary Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%–8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. Availability and implementation Our code and resources are available at https://github.com/dmis-lab/ConNER/.


A Implementation Details
. Hyperparameters of the four biomedical named entity recognition settings. For common hyperparameter settings, we focus on searching batch size, learning rate, and training epoch for learning schemes. We also focus on searching for the optimal hyperparameters of the uncertainty threshold Γ. We train ConNER using BioLM (Lewis et al., 2020a) to treat biomedical entity types. This pretrained language model is commonly used as a backbone model in the biomedical domain, and we adopt this model to generate contextualized representations. We set 128 tokens as the maximum sequence length for the sentence-level context and 512 tokens for the document-level context, and tokens with more than the maximum sequence length were truncated. The batch size was set to 32 for the sentence level and 6 for the document level. We select a learning rate in the range {3e-5, 5e-5}. We search for a training epoch in the range {30, 40, 50}. We suggest our total hyperparameter settings in Table 1. We train our model with a single NVIDIA Titan RTX (24GB) GPU for fine-tuning, and the training time took less than 2 hours.

B Breakdown of Error cases
In Table 2, we break down the error cases of BioLM trained on document context in the NCBI-disease validation dataset (Dogan et al., 2014). We derive the 100 random cases to classify the predictions as false negative (81%) and false positive (19%). We focus on the largest proportion of errors in our case, which is the misprediction of entity boundaries. Many of these cases included modifiers such as adjectives and prepositions (false negative 49% and false positive 15%). Upon examining the attribute functions in our training dataset, we find that the consistency score of entities containing modifiers was generally low. To address this, we propose the ConNER approach to make more consistent predictions. Table 3 shows the sample predictions of BioLM (left column) and ConNER(right column) on NCBIdisease dataset. Each row shows a portion of an entire paragraph, and it demonstrates the improvement of inconsistent predictions to consistent ones. Most of the entities are shown to have modifiers in shortlength entities and a considerable number of predictions are not consistent. We can see that through the Increased coronary heart disease in Japanes -American men with mutation in the cholesteryl ester ... strongly genetically determined and show a general inverse relationship with coronary heart disease ( CHD ) .

C Consistency Enhancement on predictions
Increased coronary heart disease in Japanes -American men with mutation in the cholesteryl ester ... strongly genetically determined and show a general inverse relationship with coronary heart disease ( CHD ) .
proposed label refinement process, we can get more consistent predictions in a considerable number of contexts.

D Ablation Studies
D.1 Performance of removing each loss term Table 4 shows our ablation result on four biomedical NER benchmarks. We evaluate our approach ConNER by removing its components: 1) distillation (-L distill ) and 2) the label-refinement process (-L label ). The experiments show that ConNER is effective for all four benchmarks. Specifically, we observe that the AnatEM and Gellus datasets show significant improvement, demonstrating that our approach ConNER is effective for datasets with low label consistency on the entities. We also observe that adding each component consistently improves the recall metrics. These observations correspond to an advantage of ConNER approach, whereby it decides which token should be refined by relying on the uncertainty threshold Γ.

D.2 Impact of threshold Γ
We provide an interpretation of the threshold Γ (see Figure 1). We observe that Γ = 0.3 works well in our four benchmarks, except for the Gellus dataset. When Γ is higher than 0.6 in the NCBI-disease dataset, the F1 score yields the same results. We analyze the situation in which a biomedical pre-trained language model has already demonstrated stable performance in predicting biomedical entities. We find that ConNER approach does not have to interfere with the predictions of the main classification layer. In contrast, in the Gellus dataset, Γ = 0.8 shows the best performance with high precision and recall metrics. The results were found to be steady when we compute five times using different seeds. However, we suggest using Γ = 0.3 to achieve a stable performance on all benchmarks.