No means ‘No’: a non-improper modeling approach, with embedded speculative context

Abstract Motivation The medical data are complex in nature as terms that appear in records usually appear in different contexts. Through this article, we investigate various bio model’s embeddings (BioBERT, BioELECTRA and PubMedBERT) on their understanding of ‘negation and speculation context’ wherein we found that these models were unable to differentiate ‘negated context’ versus ‘non-negated context’. To measure the understanding of models, we used cosine similarity scores of negated sentence embeddings versus non-negated sentence embeddings pairs. For improving these models, we introduce a generic super tuning approach to enhance the embeddings on ‘negation and speculation context’ by utilizing a synthesized dataset. Results After super-tuning the models, we can see that the model’s embeddings are now understanding negative and speculative contexts much better. Furthermore, we fine-tuned the super-tuned models on various tasks and we found that the model has outperformed the previous models and achieved state-of-the-art on negation, speculation cue and scope detection tasks on BioScope abstracts and Sherlock dataset. We also confirmed that our approach had a very minimal trade-off in the performance of the model in other tasks like natural language inference after super-tuning. Availability and implementation The source code, data and the models are available at: https://github.com/comprehend/engg-ai-research/tree/uncertainty-super-tuning.


Introduction
Most of the medical data like discharge summaries, pathology reports and radiology reports are in textual form, these data are then used for various research and clinical analysis purposes. Nowadays, artificial intelligence (AI) has been proven as one of the prominent sources for conducting such experiments and deriving insights from these clinical data. Detecting negation, speculation in such sensitive records is one of the unavoidable prerequisites in many information retrievals, extraction tasks and building intelligent systems such as a system providing decisive criteria for recruiting a patient or not in cohort selection for any clinical trial. Hence, it is really important to have models that understand the context of terms appearing in the medical records data.
At present, we have various state-of-the-art (SOTA) models in the biomedical domain [BioBERT (Lee et al., 2019), BioELECTRA (Kanakarajan et al., 2021) and PubMedBert (Gu et al., 2020)] which have performed great in various bio NLP tasks. We investigated these models' capabilities of understanding negative context by calculating the cosine distance between non-negative sentences (e.g. NAC had an effect on the half-life of E-selectin or VCAM-1mRNA) and their corresponding negative sentence (e.g. NAC had no effect on the half-life of E-selectin or VCAM-1mRNA) pairs. We found that none of these models were able to differentiate between the sentence pairs as the cosine distances between them were very less and approximately close to 1. This experiment gave us insight into the model's inability to understand negations.
To enhance these models' embeddings for helping them understand the negation and speculation context in medical data, we propose a unique methodology of super-tuning the models using a proposed synthesized dataset. To create the synthesized dataset, we start with negation medical sentences from the BioScope full-text corpus (excluding abstracts) which is a negation cue annotated dataset by building a parser class to extract the negative sentences and their corresponding cue. Later, we transform these negative sentences into their affirmative sentences manually. Then to increase the data points, we used paraphraser models like T5 (Raffel et al., 2020), Pegasus (Zhang et al., 2019) and generated new sentence pairs. In the end, we generated 56667 sentence pairs data points and a corresponding score was assigned to it in a range of À1 to 1.
We now super-tune the model on these 5.6k data points with Cosine Similarity Loss with a Siamese network structure. For each sentence pair, we pass sentence A and sentence B through our network which yields semantically meaningful embeddings and can be compared with cosine similarity. This process is known as supertuning and this allows our network to recognize if negation and/or speculation is present in sentences. We now fine-tune this supertuned model on different tasks like detecting cue (e.g. This is not a lump. 'not' being predicted as cue) and scope (e.g. This is not a lump. 'a lump' being predicted as scope) in a sentence. The major contributions of our work can be summarized as follows: • Created and published a synthesized dataset that can be utilized for making any bio model understand the negation context. • Introduced a super tuning method that can be a plugin before the fine-tuning task to make the embeddings smarter in identifying negation and the speculation context in a sentence. • The resultant model achieved SOTA on negation, speculation cue and scope detection tasks on BioScope abstracts as well as on the Sherlock dataset.

Literature survey
To date, all the well-known algorithms and models in the negation and speculation area have been focused on detecting cues and their scope in sentences. These algorithms have been developed using various intuitive approaches such as rule based, statistical machine learning and deep learning. Chapman et al. (2001) developed NegEx, a rule-based approach that makes use of regular expressions and was designed for determining whether a disease is present or absent in a medical diagnosis report. Another rule-based approach model NegFinder was developed by Mutalik et al. (2001) which is a lexical scanner that generates a finite state machine and a parser built on regular expressions. The team from U Washington, White et al. (2012) applied their rule-based idea of using regular expressions on the Sherlock dataset and scored an F1 score of 90 on the dataset. Apart from these, Peng et al. (2017) developed NegBio, a rule-based approach that utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or speculation, and achieved an F1 score of 95.9 on the BioScope dataset.
One disadvantage of rule-based approaches is that they do not generalize well for different domain data as well as need customization of rules for new corpus or domain. This problem does not occur while using machine learning algorithms. In 2009, Morante and Daelemans (2009) used a memory-based learning algorithm (IGTREE), for detecting cues. For scope resolution, they used three classifiers (memory-based learning algorithm, SVM and CRF) which predicted whether a given word is the beginning of a scope, end of the scope or neither. It achieved SOTA on BioScope abstracts and full papers negation cue detection task by gaining F1 scores of 98.68 and 97.81, respectively. Other negation models using machine learning were developed by Agarwal and Yu (2010) and Councill et al. (2010) utilizing popular statistical approaches such as support vector machines (SVM) and conditional random fields (CRF), respectively.
Various neural network approach-based modeling has been developed in negation scope detection tasks. Qian et al. (2016) developed a convolutional neural network (CNN) based architecture that first classifies whether a token is a negation cue or not and later uses a CRF layer at the last layer to output a sequence determining the scope of negation in the input sentence. Fancellu et al. (2017) developed a negation cue detection model wherein a dependency tree is passed as input to LSTM architecture. Chen (2019) used attention-based deep learning architecture to detect negation and assertion in clinical notes. Transfer-learning approach was used by Khandelwal

Super-tuning dataset preparation
The BioScope corpus (Szarvas et al., 2008) includes three different sub-corpora: Abstracts of biological papers from the GENIA corpus (Collier et al., 1999), full scientific papers from Flybase and BMC Bioinformatics website, and clinical radiology records corpus. Its medical and biological texts have been annotated for negation and their linguistic scope. This was done to allow a comparison between the development of systems for negation/hedge detection and scope resolution. We build a custom parser that filters out the negation data from only the Bioscope's full scientific papers excluding abstracts. Along with this, we manually created a dataset with negation and their corresponding affirmation. Thus, as a result, we form sentence pairs which we then assign three scores (À1,0,1). Score À1 indicates that the cosine similarity between the sentences in the pair should be more as though the sentences are giving information regarding the same thing but contextually are opposite. Score 0 indicates that the two sentences in a pair are completely different in terms of context hence the cosine similarity should be 0. Score 1 indicates that the sentences in a pair are similar in context hence cosine similarity between them should be 1.
Once we have the above data i.e. sentence pairs with their labels we pass them to paraphrase models: T5 by Raffel et al. (2020) and Pegasus by Zhang et al. (2019). Paraphrasing is a task that creates new sentences for an input sentence that expresses the same meaning using a different choice of words. T5 paraphraser is an encoderdecoder structured transformer pretrained on 750 GB of diverse texts which uses Google's Universal Sentence Encoder (USE) (Cer et al., 2018) to create an embedding of each sentence. These embeddings are 512-D vectors that are produced in such a way that related sentences will be closer to each other in the vector space than unrelated sentences. The Pegasus paraphraser is also a transformer model which was pretrained in a similar way of summarization wherein important sentences are removed and masked from the document for the model to recover them in the output. We pass each sentence pair of all the three scores (À1,0,1) to output new four sentence pairs with the same meaning as the input sentence pair but with different choices of words on all the three labels. With this method, out of 15 296 sentence pairs altogether of the three scores (À1,0,1), we were able to generate 56 999 new sentence pairs.

Super-tuning strategy
The dataset that we developed was of sentence pairs with their respective scores (À1,0,1) which indicates cosine similarity between the sentences of a pair. We used the architecture of sentence-BERT implemented by Reimers and Gurevych (2019) which used cosine similarity loss for training. It is the Siamese and triplet network that updates the weights such that the produced sentence embeddings are meaningful semantically and then can be compared with cosinesimilarity. The model adds a pooling operation to the embedding input from the existing model as shown in Figure 3.
The cosine similarity loss function takes two sentence embeddings (u, v) and returns their similarity score between À1 and 1 as depicted in Figure 3. Equation 1 states the formula for the calculation of cosine similarity loss between embedding spaces u, v.
The mean squared error loss is used on the original label versus the predicted cosine similarity as shown in equation 2, where Y i represents the given cosine similarity score for the embedding pair u, v. The model uses the loss function, updates its weights and tries to minimize the loss.
We incorporated different bio-model embeddings such as BioBERT, PubMedBERT and BioELECTRA by replacing BERT at the input embedding layer with these three model's embeddings at every round of the experiment as shown in Figure 3. BioBERT, PubMedBERT and BioELECTRA are pre-trained language models for the biomedical domain. The BioBERT model has been pre-trained on Pubmed data for 1M steps. The PubMedBERT has also been pre-trained on Pubmed data, but it differs from BioBERT in the initialization of weights. BioBERT is initialized using BERT weights which have been pre-trained on the general domain corpora whereas PubMedBERT has been pretrained from scratch and just on pure domain-specific data. Both these models are based on BERT architecture which is a transformer-based model using a multilayer and multi-head self-attention mechanism. The BioELECTRA model is based on ELECTRA architecture which comprises a generator and discriminator. It has also been trained on Pubmed data from scratch. The results of incorporating these three models in the described SBERT architecture which we name as NegBioBERT, NegPubMedBERT and NegBioELECTRA are stated in Section 4.1.

Fine-tuning tasks
We checked all our three models NegBioBERT, NegPubMedBERT and NegBioELECTRA on two tasks: natural language inference (MedNLI) and sentence similarity (BIOSSES). MedNLI (Shivade, 2019) dataset consists of sentence pairs developed by Physicians annotated for definitely true, maybe true and definitely false. The dataset contains 11 232 training, 1395 development and 1422 test instances. BIOSSES (So gancıo glu et al., 2017) is a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the Text Analysis Conference (TAC) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores in a range [0-4]. As per our initial results on the super-tuning approach, we found NegBioELECTRA outshining the other two models hence, we used it for Negation, Speculation cue, and scope detection on Bioscope abstracts and Sherlock datasets respectively. We also ensure that there is no data leakage as none of the data in super tuning are reused in any fine-tuning tasks. Since Bioscope's full papers are used for super-tuning we used only abstracts for fine tuning. For negation, speculation cue and scope detection on the BioScope and the Sherlock (Mirsky, 2016) datasets, we use a similar transfer learning approach as NegBERT introduced by Khandelwal and Sawant (2020). Our approach differs from NegBERT in deciding the labels for the cue detection model. Unlike NegBERT, we have provided separate classes for negation, and speculation cues. So, our model is capable of finding if the detected cue is negative or speculative. We provide the following labels in our token classification approach to each word/sub-word for the BioScope dataset: 0 -Padding 1 -Normal word 2 -Negation Cue

-Speculation Cue
And for the Sherlock dataset the labels are as follows: 0 -Padding 1 -Normal word

-Cue
We output negation, speculation sentences and their cues, scopes using our built parser as mentioned in Section 3.1. We label each token in every sentence using the above-mentioned labels and create the dataset to be fed to the cue detection model which is illustrated as per the BioScope dataset in one example as: We use our NegBioELECTRA model's tokenizers for tokenization as well as add padding tokens so that the length of input and output matches. These data are then passed to the model which provides probabilities of label per token. We do the required post-processing to get one label per token or word of the sentence. The whole flow of the cue detection model is depicted in the following example:  : [1,1,3,2,1,1,1,1,0,0,0 For scope resolution, we use binary labels i.e. 1 and 0 to label per token of a sentence if it is a word in scope or not, respectively. Also, we annotate the negation, speculation cues in a sentence by adding [NEG] and [SPE] before and after the cue, respectively. These two are our special tokens with the help of which we are trying to make understand the model that cues are present in the sentence and their location. If there are multiple cues in a sentence, then we send sentences annotating them per cue per instance manner. Illustration to understand this process as follows: After data preparation, we use a similar method as cue detection wherein we use our NegBioELECTRA model's tokenizer for tokenization as well as add padding tokens so that the length of input and output matches. These data are then passed to the model which provides probabilities of label per token. We do the required postprocessing to get one label per token or word of the sentence. The whole architecture is proposed in Figure 4. Cue-annotated sentences and labels per word as per binary labelling scheme.
NegBioELECTRA Fig. 4. A descriptive diagram illustrating the flow of super-tuning and using the model for negation, speculation cue, and scope detection tasks on the BioScope and the Sherlock datasets. The super-tuned model acts as the base model for both cue and scope detection tasks 4 Experiments and results

Experimental setup
We evaluated different bio-model embeddings such as BioBERT, PubMedBERT and BioELECTRA on their understanding of negation context. For this evaluation, we took some sentence pairs among which one sentence is negative and the other is its affirmed sentence. Then output their embeddings from each of the three models and evaluate the cosine similarity between them. We found that the cosine similarity of each of the sentence pairs from all these models was approximately 0.99 (depicted in Fig. 5) which stated that the model embeddings failed to understand the negation context. Now, we replace the input embedding layer in the SBERT architecture with BioBERT, PubMedBERT and BioELECTRA embedding layer one by one as depicted in Figure 4. We feed our prepared training data as explained in Section 3.1 and train each model with a different embedding layer for 10 epochs. Once each of these models is supertuned we check which resultant model has performed well in terms of cosine similarity score. The models are fine-tuned on the MedNLI and BIOSSES dataset on different learning rates [1e À5 , 1.5e À4 , 2e À5 , 2.5e À4 , 3e À5 , 5e À5 ], batch sizes [16,32] and epochs  to check if our approach had any effect on the performance of the original models. The best model is fine-tuned on the negation and speculation cue, scope detection task on the BioScope abstracts and Sherlock datasets as described in Section 3.3 on different learning rates [1e À5 , 1.5e À4 , 2e À5 , 2.5e À4 , 3e À5 , 4e À5 , 5e À5 ], batch sizes [16,32] and epochs . We ensured that the train, test and validation data splits were made according to previous SOTA achieved model's standards.

Results
We retrieve the new sentence embedding from each of the models which we name as NegBioBERT, NegPubmedBERT, NegBioELECTRA for some of the negation and its affirmation sentence pairs. We then calculate the cosine similarities between each sentence pair embeddings output from the three models and compare them. We found that our super-tuning approach has successfully made these embeddings understand the negation context as now the distance between the embedding of each sentence pair from all three models is more and not approximately equal to 0.99 as it was earlier depicted in Figure 5. This comparison can be clearly seen in Figure 7 where we plot the sentence pairs tagged as data points on the X-axis and cosine similarity on the Y-axis. Similar improvement can be seen in terms of Euclidean distance in Figure 8 by comparing with results in Fig. 6 where we plot the sentence pairs tagged as data points on the X-axis and the Euclidean distance between each pair on the Y-axis. We also found that the super-tuning approach had a minimal trade-off in accuracy with respect to the original model on test data of the MedNLI and BIOSSES dataset.
We used NegBioELECTRA for negation, speculation cue and scope detection tasks on BioScope, Sherlock datasets. On the test data, we found NegBioELECTRA out-performing all the current models in the respective task. On all the tasks mentioned in Figure 9, we observed a mean gain of 1.35. For negation, speculation cue detection on BioScope Abstracts, we achieved an F1 score of 99.02 thus beating the previous SOTA score by 0.34 points. For Negation Scope Resolution on BiosScope Abstracts, we found our model outperformed by scoring an F1 score of 98.94 and beating the previous SOTA by 3.2 points. Along with the other two tasks, the model also gained 0.5 points in the earlier SOTA score in Speculation Scope Resolution on BioScope Abstracts by scoring an F1 score of 98.37. The model outperformed and marked the same progress in the Sherlock dataset cue detection task by scoring an F1 score of 99.56 and beating the previous SOTA by 6 points. Similar progress was seen in the Sherlock Scope Resolution task wherein the model gained 4.9 points over the previous SOTA by scoring an F1 score of 97.26.

Conclusion
We release the super-tuned BioELECTRA i.e. the NegBioELECTRA model which for any given sentence predicts negation or speculation   cue and then generates scope for that particular cue. For carrying out the described super-tuning approach, we synthetically produced a dataset using T5 and Pegasus paraphraser on a manually curated BioScope dataset. Our results show that this super-tuning approach has enriched the existing model's capabilities in negation, speculation cue and scope detection as well as its embeddings understanding on the negation, speculation context. We have achieved SOTA on the following tasks: BioScope abstracts negation, speculation cue as well as scope detection, Sherlock's negation cue and scope detection. NegBioELECTRA unlike its predecessors has the ability to detect and identify the cue as negative or speculative. The great results of super-tuned bio models working on general domain datasets are because the same cues are present in the domain and non-domain datasets. The cues that the models have been trained on for the respective tasks hence are limited. Hence, we feel more datasets with new cues are needed to bring more revolutions in the negation and speculation area of AI. Along with this, we encourage research to focus more on curating domain-specific as well as general-domain datasets for the supertuning approach. This will help the models to be aware of the speculation in every domain. We feel that every emerging natural language model should have the ability to understand negated and speculative sentences. Thus, we need more such super-tuning approaches which enhance the model's capabilities of understanding the negation and speculation context with a minimal trade-off on its performance on other tasks. A sentence cannot only be negative or speculative but also declarative, exclamatory, imperative, interrogative, and sarcastic or opinionated. Hence, our interest greatly relies upon developing a generic model which can understand all of these forms of sentences.