NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities

Abstract Motivation This article describes NEREL-BIO—an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. Results NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL → NEREL-BIO) and cross-language (English → Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension models and report their results. Availability and implementation The dataset and annotation guidelines are freely available at https://github.com/nerel-ds/NEREL-BIO.


Introduction
The lack of richly annotated training datasets is a well-known challenge in developing biomedical entity extraction systems.The majority of existing datasets and named entity recognition (NER) methods have been designed for capturing flat (non-nesting) mention structures over coarse entity type schemes.Moreover, the annotated entities in these corpora are limited to the most common entity types such as drugs/chemicals and diseases (Leaman et al., 2009;Gurulingappa et al., 2010;Van Mulligen et al., 2012;Wei et al., 2016).
GENIA (Kim et al., 2003) is a widely studies corpus for biomedical named entity recognition in English consisting of 2,000 PubMed abstracts with more than 400,000 words and almost 100,000 annotations for biological terms.The annotated abstracts are devoted primarily to biological reactions concerning transcription factors in human blood cells.47 entity types organized in taxonomy were annotated.The annotation includes nested and fragmented (non-continuous) entities.Yet, only 17% of the entities in the GENIA corpus are nested within another entity Katiyar and Cardie (2018).Mohan and Li (2018) describes MedMentions corpus of 4,392 PubMed abstracts, which is annotated with 21 entity types, including disorders, anatomical structures, chemicals, and also some general concepts such as organizations, population groups, etc.However, we chose to annotate the most specific concept in texts without any overlaps in mentions.Recent work has shown an increase in interest in nested entity structure on general-domain data on various languages, including English (Ringland et al., 2019), Russian (Loukachevitch et al., 2021), Thai (Buaphet et al., 2022), and Danish (Plank et al., 2020).Most research so far, including for Russian, focused on newswire data.As for the biomedical and clinical domain in Russian, there are several datasets of clinical texts or drug-related user reviews with flat entities Tutubalina et al. (2021); Nesterov et al. (2022).A recent work on a Russian medical language understanding benchmark (Blinov et al., 2022) includes the RuDReC corpus (Tutubalina et al., 2021) for NER.However, these corpora ignore nested entities, like "pain in the head" being both a disease and an anatomy entity.To encourage the development of state-of-the-art information extraction systems aimed at providing more comprehensive coverage of biomedical concepts, we decided to construct a large nested named entity dataset NEREL-BIO over Russian PubMed abstracts.All entity mentions, including nested structures with up to six layers of depth, are manually annotated.
Fig 1 presents an example of nested named entities in NEREL-BIO.It discusses "isolated bronchus resection for central cancer" and provides the results of surgical treatment in these specific conditions.Entities "bronchus", bronchus resection", "resection" are included in the Unified Medical Language System (UMLS) (Bodenreider, 2004), while "isolated bronchus resection" and "central cancer" are not.Nested entity annotations create a basis for establishing relations between correct (longer) entities, as well as linking internal entities to equivalent UMLS concepts.
The main contributions of our work are summarized as follows: 1.We present NEREL-BIO, a new biomedical dataset for nested NER in Russian with a smaller corpus in English.2.We evaluate BERT-based Machine Reading Comprehension (MRC) and sequence models (Li et al., 2020;Shibuya and Hovy, 2020) for biomedical nested NER.3.To promote further cross-lingual research, we annotate a subset of 100+ English abstracts in translation from Russian using the same annotation scheme.

Data collection and annotation
NEREL-BIO extends the annotation scheme of the general-domain Russian dataset NEREL (Loukachevitch et al., 2021).Since NEREL-BIO includes all entity types and relation types from NEREL, we provide a

Overview of NEREL nested named entities
To the best of our knowledge, NEREL is the first dataset annotated simultaneously with nested entities, relations between those entities, and knowledge base links (Loukachevitch et al., 2021(Loukachevitch et al., , 2022)) •basic entity types: PERSON, ORGANIZATION, LOCATION, FA-CILITY and geopolitical entities subdivided into COUNTRY, CITY, DISTRICT and STATE_OR_PROVINCE entities; •numerical entities: NUMBER, ORDINAL, DATE, TIME, PERCENT, MONEY, AGE; •socio-political entities (NATIONALTY, RELIGION, IDEOLOGY) and LANGUAGE; •law-related entities (LAW, CRIME, PENALTY); •work-related entities (PROFESSION, WORK_OF_ART, PRODUCT, AWARD); The DISEASE entity type is most relevant to the biomedical domain.It can be also noted that PERSON, ORGANIZATION, LOCATION entities from the basic entity group in the general domain were annotated in the biomedical MedMentions corpus (Mohan and Li, 2018), locations were also annotated in the QUAERO corpus (Névéol et al., 2014) (OCCU) type.Some NEREL entities (such as WORK_OF_ART, AWARD, PENALTY) are less relevant to the biomedical domain.

Text Collection
We used sourced documents from the WMT-2020 Biomedical Translation Task collection (Bawden et al., 2020) that contains 6,029 Medline abstracts in Russian and their English translations.(Mohan and Li, 2018) and applied it to Russain abstract in a zero-shot fashion.We picked about 100 documents with the densest and most diverse recognized entities.
Based on the analysis of this automatic annotation, we selected abstracts with disease mentions and related laboratory or medical procedures for including in NEREL-BIO.The abstracts were annotated using the BRAT annotation tool (Stenetorp et al., 2012).To facilitate manual annotation, initial annotation was done automatically with two models: multilingual BERT (Devlin et al., 2019) trained on the English MedMentions (Mohan and Li, 2018) for biomedical entity recognition (10 entity types) and MRC model (Li et al., 2020) trained on the NEREL dataset, that helped labeling nested entities from the general domain (29 entity types).The automatic techniques provided the annotation of most evident entities and became a basis for further manual labeling.
Table 1 summarizes statistics of NEREL-BIO in terms of documents and entity mentions.Table 2 contains most frequent disease mentions in the Russian part of NEREL-BIO.It can be seen that abstracts are quite diverse in content.

Entity Types
Biomedical entity types for selected for annotation based on their presence in the UMLS taxonomy and other annotated datasets in the biomedical domain.16 specialized biomedical entity types and 29 entity types from the general NEREL dataset are included in NEREL-Bio.The full set of entity types, explanations, and examples in NEREL-BIO are presented in Table 4.
Biomedical entity types in Table 4 correspond to the most relevant UMLS concepts and are annotated according to UMLS definitions.There are a few exceptions as given below: •HEALTH_CARE_ACTIVITY, which is described as a quite general concept in UMLS, is treated as health care administration and organization activities such as hospitalization or medical evacuation; •LABPROC entity comprises both laboratory and other diagnostic procedures; •FINDING entity does not have a direct correspondence in UMLS, it conveys the results of the scientific study described in the abstract, e.g.longer hospital stay, stopped the progression.3 are attached to the most relevant UMLS semantic types and ordered according to the UMLS taxonomy.Also for each entity type, the corresponding UMLS concept is found and its identifier (CUI) is included in the table.Table 3 also contains statistics for the Russian and English parts of the NEREL-BIO corpus.
It can be seen that all entity types were successfully linked to the UMLS taxonomy except for WORK_OF_ART which is missing in UMLS.Rare or absent entity types in the NEREL-BIO dataset are as follows: IDEOL-OGY (0), RELIGION (0), AWARD (2), LANGUAGE (5), PENALTY (0), CRIME (1), and LAW (10).At the same time, we could see quite diverse mentions of geographical locations and some of the money (mainly in the context of medical expenses).Mentions of professions or occupations are quite frequent: mainly medical specialists are mentioned, but also there are studies on occupational diseases of specific professional groups.
Some principles of annotation employed in the general domain were changed NEREL-BIO.In particular, in the general domain, mainly capitalized mentions were annotated as named entities.In the biomedical domain, the same entity types can also appear as lower-cased mentions: •any humans or groups can be annotated with the label PERSON such as patient, control group, population with low income; •ORGANIZATION tag is used not only for tagging specific organizations but organization types such as hospital, medical institution, rehabilitation center.
•location-related tags (LOCATION, COUNTRY, CITY, STATE_OR_PRO-VINCE, DISTRICT, FACILITY) are also used in both cases:rural settlement, low-income countries, coastal areas.
Entities in NEREL-BIO often appear lower-cased while being absent in UMLS.For example, the term left-sided congenital diaphragmatic hernia is absent in UMLS.We annotate this as follows: Although we cannot link the whole term in UMLS, we can link the sub-terms: Hernia (C0019270), Diaphragmatic Hernia (C0019284), Respiratory Diaphragm (C0011980), Congenital diaphragmatic hernia (C0235833).
For annotating multiword terms we followed the following guidelines: •two-three word terms in form of noun groups without prepositions discussed in texts are annotated without additional checks; •longer multiword phrases containing prepositions should be supported with some additional evidence, for example, there can be an abbreviation in the text for a long multiword term (ST-segment elevation acute coronary syndrome -STSEACS), a long term or its English equivalent can be found in UMLS (Metastasis from malignant tumor of liver C1282502) or other biomedical resources; •internal spans in an annotated multiword term (single words or phrases), which can be considered to be valid biomedical terms, are also annotated with corresponding entity types; •general adjectives, adjectival quantifiers are not included in the annotated entity: various tumors are annotated as various[tumors] DISO .
The annotation scheme was created during multi-round preliminary annotation of parallel Russian and English abstracts.Terminologists experienced in terminological studies including the biomedical domain were involved in the annotation.All annotated abstracts were additionally checked by a moderator.
In Table 5 we provide a brief summary of how frequently nested entities appear in NEREL-BIO.For each entity type, we counted how many times entities of this type appear as an outer entity (eliminating multiple occurrences of the same entity), and divide this number by the total occurrences of the entity type in the corpus.Then we filter out the types with Frequencies of nested entities in Russian and the smaller English corpus were mostly comparable.The differences can be explained by the following: 1. the abstracts are not fully parallel: paper titles are absent in Russian abstracts but included in English abstracts; 2. the different syntax of languages determines different structures of sentences and nestedness; 3. sentences in Russian and in English are not always direct translations but can be significantly reformulated.Two last factors especially affected the FINDING entity since these can be long and therefore can be formulated in multiple ways.
Additionally, we analyzed nested entities in the following way.We aggregate typical pairs of nested entities from the corpus.Each pair has an outer and an internal entity.Table 6 presents top ten pairs of types for such entities.Note, that an outer entity can contain one, two or more internal entities.In fact, the NEREL-BIO dataset has outer entities that contain up to eight internal entities at the same level of nestedness.
Therefore, we provide raw counts in the table.Overall, the Russian part of the NEREL-BIO contains 22,392 such pairs (the English subset has 3,864 nested entity pairs).

Models
MRC task is formulated in the following way: for the given context X and question Q the model should obtain answer A with some function F defined as A = F (X, Q).In the named entity recognition task, X would be the given sentence/paragraph; Q is some generated or selected query sentence for a given named entity type; A is the subsequence of the context X that denotes the named entity; F is the retrieving model itself.
For the sequence model, we employed three binary classifiers based on the output of the last hidden layer from the RuBERT model (Russian BERT) (Kuratov and Arkhipov, 2019).The first classifier determines the starting position of the named entity.The second classifier determines the ending position of a named entity (possibly different) of the same class.The third classifier decides, whether chosen start-end pairs represent a single named entity of such class.These classifiers are trained for each class (type) separately.Batch size was set to 16 with maximum length of the sequence to be 192 tokens.Model was trained during 16 epochs on 8 Tesla V100 GPUs.Other parameters set to default values after (Li et al., 2020).
For training the MRC model, we each entity type (e.g.ORGANI-ZATION), we employed manually collected definitions of corresponding concepts from dictionaries (including Wikipedia), frequent mentions of an entity type in the training collection, some contexts (sentences) from the training collection, and keywords (e.g.ORG).
We compared several question variants: Keyword: the question consists of entity tags such as DISO or ANATOMY (Li et al., 2020) Component-based: 2-5-10 most frequent lemmatized components of a given entity are used for formulating a query, for example "DISO are entities such as a tumor, complication, disorder, disease, illness" (5-component example).Previous experiments with the general NEREL dataset showed that component-based questions outperformed other variants (Rozhkov and Loukachevitch, 2022); Contextual: a sentence from the training sample containing a named entity of a given type without explicit or implicit labeling used for this entity in the sentence.For example, a question for DISO entity type can be as follows: "60 patients in the most acute period of hemispheric ischemic stroke were examined." Lexical: as in the contextual variant, a sentence from the training corpus is used as a question; additionally, the entity of a given type is masked with its label (Zhou and Chen, 2021).An example of a masking sentence with several mentions of an entity looks as follows.The initial sentence contains three mentions of DISO: "The addition of gout contributes to endothelial dysfunction and worsens the course of hypertension.".The corresponding lexical question: "The addition of DISO contributes to DISO and worsens the course of DISO." We used the so-called full lexical approach, when all entities in a sentence of a given type are substituted with masks.If a longer entity contains a shorter entity of the same type, the longer entity is preferred (outmost variant).The example of the lexical variant corresponding to the above-mentioned contextual example is as follows: "60 patients in the most acute period of hemispheric DISO were examined".The selection of a sentence for contextual or lexical questions is carried out in the following manner: •The most frequent entity for a given entity type is selected •The first sentence in the training set that contains the selected entity is extracted to be used as a question.By "first" we imply here the lexicographic order of the filenames of the original dataset.
We also provide experimental results for the second-best Sequence model (Shibuya and Hovy, 2020) since it gave comparable results in the NEREL dataset.For this setup, we employed RuBERT model with batch size set to 16 and the same length of 192 tokens.The model was trained for 32 epochs on 8 GPUs while other parameters were set to default values.

Results
Span-level micro-and macro-averaged precision, recall, and F1 results of the models are shown in Table 7.The performance of the 5-component MRC model for the ten most frequent entities is presented in Table 8.
As shown in Table 7, the best macro-averaged results are achieved by the 5-component model.Depending on entity types, performance of the 5-component model varies greatly (see Table 8).In particular, this model achieves 85% F1 and 61% on ANATOMY and PHYS, respectively.We note that the best obtained results of nested NER for NEREL-BIO are lower than for NEREL (where MRC model achieved 80% micro-F-measure).This is in line with existing published NER results obtained that also show similar decreased results on biomedical texts (Shibuya and Hovy, 2020;Liu et al., 2022).The results for second-best sequence model are closest to the MRC model in micro measures but significantly worse in macro measures.This can be partly be explained by the low amount of training data for specific entity types (Artemova et al., 2022).

Error Analysis
We analysed the results of the best MRC model on the NEREL-BIO test collection in comparison with manual annotation and found the following frequent types of errors: •misclassification of abbreviations, which can be of different entity types but look very similar: IPN (iskrivliniye peregorodki nosa -deviated septum of the nose), MPT (methadone maintenance therapy).
•evidently longer entities than necessary were extracted: including verbs ("subgroup was taken in the 2nd group"), including conjunctions ("ART and MMT"), etc., with a comma in the middle ("EMBASE, Medline"), etc.; •some irrelevant entities can be labeled, for example, "level of education" was classified as PHYS.
Missing entities were found in human markup due to the difficulty of the annotation task itself.For example, in the "mild cognitive impairment" phrase, an annotator missed labeling "cognitive impairment".

Discussion and Limitations
Several issues may potentially limit the applicability of NEREL-BIO; they are mostly shared with other available datasets.
Seen and unseen mentions of entities.Recent works on BERT-based models for information extraction demonstrate that the generalization ability of these models is influenced by domain shift or whether the test entity/relation has been seen in the training set Miftahutdinov et al. (2020); Tutubalina et al. (2020); Kim and Kang (2022).To avoid such biases, Kim and Kang (2022) removes overlaps in entity mentions and concept identifiers between training and test sets while Tutubalina et al. (2020) focuses on zero-shot entity linking between different concept terminologies.
We leave these approaches to future work.We plan to investigate how well MRC models for nested NER can be adapted to unseen mentions.
Knowledge transfer between general and biomedical domains.The proposed NEREL-BIO corpus shared annotation scheme with our generaldomain dataset NEREL for common entity types such as AGE, NUMBER, FACILITY, and ORGANIZATION (29 types in total).Transferability of trained models across two datasets with completely different contexts can be limited due to domain shift, while sequential training can cause complete retraining of model weights.We mark the investigation of strategies for combining different domains for future work.
Disease-centric abstracts NEREL-BIO includes PubMed abstracts describing the results of clinical trials, hospitalization, and treatment of patients.The most frequent entities (e.g., diseases, injury, and anatomy) are related to a clinical domain, while biological entities such as genes and proteins are less presented.We suppose that this restricts the extraction of new biological relationships for protein-protein interaction or knowledge graph completion tasks, which will require additional data annotation.

Conclusion
Biomedical texts contain numerous nested mentions of entities such as anatomical parts within each other, diseases containing body parts or chemicals, names of procedures, which include diseases or devices, etc.In this paper, we presented the first Russian dataset of biomedical abstracts NEREL-BIO, annotated with nested entities.The selected abstracts focus primarily on diseases and related medical procedures.The dataset contains a small collection of annotated parallel English abstracts.Our annotation shows that nested entities provides a better basis for extracting relations that would otherwise be lost.Similarly, nested entities also permit more complete entity linking to knowledge bases.Since, NEREL-BIO extends the annotation scheme of the general-domain Russian NEREL dataset, it permits studying domain transfer methods.

Table 2 .
Ten most frequent diseases in NEREL-BIO (translated from Russian) short description of 29 entity types from the NEREL in Sec.2.1.In Sec.2.2.2, we describe 17 biomedical entity types that have been utilized in NEREL-BIO (including DISO entity, which is renamed DISEASE from the general NEREL dataset).

Table 5 .
Frequencies of top ten entity types with nested entities in full Russian collection and 100 Russian and English documents for comparison.

Table 6 .
Top ten nested entity pairs in NEREL-BIO.occurrences in the corpus.The top ten entity types along with their nestedness frequency are presented in the table.Frequencies in the parallel English / Russian abstracts of the NEREL-BIO are shown in the last two columns of Table5.Here we compare only 100 parallel abstracts for each language.

Table 7 .
Results of nested NER models on NEREL-BIO.