BELB: a biomedical entity linking benchmark

Abstract Motivation Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. Results We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. Availability and implementation The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.


Introduction
The task of assigning entity mentions found in biomedical text to a knowledge base (KB) entity is known as Biomedical Entity Linking1 (BEL).Texts in the biomedical domain are rich in ambiguous expressions, with abbreviation being a prominent example, e.g.: "WSS" can be either "Wrinkly skin syndrome" or "Weaver-Smith syndrome".BEL resolves such ambiguities and is therefore a crucial component in many downstream applications.For instance it is used to index PubMed (Mork et al., 2013), a primary archive of biomedical literature.
Although several benchmarks have been developed for biomedical text mining, e.g.BLUE (Peng et al., 2019) and BLURB (Gu et al., 2021), BEL is notably absent from all of them.GeneTuring (Hou and Ji, 2023) contains a module to test normalization, but covers only genes and is specific for models built on the GPT-3 architecture.The lack of a standardized evaluation setup translates into a wide variety of approaches: different studies use different combinations of corpus and KB and different evaluation protocols.These differences limit severely direct comparison of results (see Appendix A).
In the biomedical domain different entity types require normalization to different specialized KBs (Wei et al., 2019), e.g.species to NCBI Taxonomy (Scott, 2012) but genes to NCBI Gene (Brown et al., 2015).Yet, important types such as genes and variants are completely absent from corpora commonly used to evaluate neural BEL approaches (see 2.1.1),which instead only target UMLS.Although adapting neural approaches to Fig. 1: We illustrate the main advantages of BELB.In (a) we see the current stand of experimental setups for biomedical entity linking.Different studies use different (i) preprocessing (I/O routines), (ii) combinations of corpora and KBs and (iii) evaluation protocols, ultimately making published numbers not directly comparable.With BELB (b) researcher have access to (i) uniformly preprocessed corpora and KBs, which can be accessed programmatically and (ii) a standardized evaluation protocol greatly reducing preprocessing overhead and maximizing comparability and reproducibility.
other KBs is possible, it leaves open the question of whether their performance transfers across entity types.Additionally, as corpora are distributed in different formats, developing new BEL approaches (or adapting existing ones to new corpora) requires writing new input-output and quality assurance routines, e.g. to correct wrong mentions boundaries, increasing the overall engineering turnaround.
To facilitate research on BEL, we introduce BELB, a Biomedical Entity Linking Benchmark.BELB provides access to 11 corpora linked to 7 KBs.All components undergo extensive data cleaning and are converted in a unified format, covering six biomedical entities (gene, disease, chemical, species, cell lines and variants).As show in Figure 1, BELB significantly lowers the barrier for research in the field, allowing to (i) train models on corpora in the highest quality possible and (ii) fairly compare them against other approaches with minimal preprocessing overhead (see Appendix B for a simple showcase).Using BELB, we perform an extensive comparison of six rule-based domain-specific systems and three neural methods.Our findings show that results of neural approaches do not transfer across entity types, with specialized rule-based systems still being the best option for the gene and disease entity types.We hope that our publicly available benchmark will be adopted by future work allowing to fairly evaluate approaches and accelerate progress towards more robust neural models.

Materials and methods
In this study we introduce BELB, a benchmark for standardized evaluation of models for BEL.The task is formulated as predicting an entity e ∈ E from a KB given a document d and an entity mention m, a pair of start and end positions ⟨ms, me⟩ indicating a span in d.We use BELB to compare rule-based domain-specific systems and state-of-the-art neural approaches.In all experiment we use in-KB gold mentions: each mention has a valid gold KB entity (Röder et al., 2018) and its position in d is given.

Biomedical Entity Linking Benchmark
We report an overview of the 11 corpora and 7 KBs available in BELB in Table 1 and 2, respectively.Their detailed description can be found in Appendix C. In the following we outline crucial properties of BEL and highlight how they are accounted for in BELB, allowing it to comprehensively analyze and fairly evaluate BEL models.

Specialized knowledge bases
In biomedical information extraction instances of multiple entity types are linked to specialized KBs (Wei et al., 2019).However, recent studies in the NLP community primarily focus on the MedMentions corpus linking to UMLS (Liu et al., 2021;Zhang et al., 2022;Agarwal et al., 2022 inter alia).Additionally, in MedMentions, entity types such as diseases and genes are covered only marginally or not at all, respectively (see Appendix D).
This calls into question how well results obtained in this setting can be transferred for instance to publications in genomics or molecular biology in general.In BELB we cover six entity types  2. Overview of the KBs available in BELB according to their entity type.We report the number of entities, synonyms per entity, homonyms and how many of them are the primary name (PN).† No archive of previous versions is provided (gene, species, disease, chemicals, cell line and variant) represented by 11 corpora linked to 7 specialized KBs (for comparison with previous studies we include UMLS as well).We design a unified schema to harmonize all KBs (see Appendix E).This allows to test a model's ability to preserve performance across multiple pairs of corpus and KB with minimal preprocessing overhead.

Unseen entities and synonyms
Corpora typically cover only a small fraction of all entities in a KB.Additionally, biomedical entities present multiple names (synonyms), e.g. both "Oculootofacial dysplasia" and "Burn-Mckeown Syndrome" are valid names of "MeSH:C563682".Hence if an entity is in the training set, it does not imply that all its surface forms are included.In BELB we assign a unique identifier to each mention and provide lists of mentions of (i) unseen entities, i.e. present in the test set but not in the train one (zero-shot) and (ii) train entities occurring in the test set but with different (caseinsesitive) surface forms (Tutubalina et al., 2020).This allows to easily report a model's performance in (i) generalizing to new entities and (ii) recognizing known ones appearing in different forms.

Homonyms
Discriminating mentions with the same surface form but representing different entities (homonyms) by their context is indispensable to BEL.This is because in biomedical KBs the same synonym can be associated to multiple entities.This is especially the case of abbreviations.For instance, as in example (a) in Table Example Entity a) Features of ARCL type II overlap with those of Wrinkly skin syndrome (WSS) MeSH:C536750 b) Weaver-Smith syndrome (WSS) is a Mendelian disorder of the epigenetic machinery MeSH:C536687 c) α2microglobulin exacerbates brain damage after stroke in rats.NCBI Gene:24153 d) The T67 cell line produced the proteinase inhibitor α2microglobulin.
NCBI Gene:2 e) We identified the novel mutation c.908G>A within exon 8 of the CTSK gene.rs756250449 f) The patient was compound heterozygous of the c.908G>A mutation in the SLC17A5 gene.rs1057516601 g) The GSK650394 inhibitor is used to suppress SGK1 expression in PC12 cells.
CVCL 0481 Table 3. Example of homonym mentions (in bold) requiring specific contextual information (underlined) for linking.
3, "WSS" is the abbreviated form of two syndromes and it appears twice in CTD Diseases.Another example are genes present in multiple species, as in (c), where the string "rats" is essential for correct normalization, as "α2microglobulin" could refer either to the human or rat gene.Indentifying contextual information can be non-trivial, e.g (c) is the title of a publication, but the text may describe general characteristics of "α2microglobulin" introducing textual cues pointing to the human gene.Additionally, this information is not always explicitly expressed and may emerge via other patterns.In example (e) "PC12" denotes a human cell line, whereas in (f) it refers to the rat one.This can be inferred from the capitalized gene mention "SGK1" in (e) which conventionally denotes human genes.By introducing entity types such as genes and variants, BELB allows to probe a model's ability to (i) identify contextual information and (ii) handle highly ambiguous search spaces (KB).

Scale
As mentioned in Section 2.1.1 studies on neural methods have primarily targeted UMLS, which, as shown in Table 2, is one and three order of magnitude(s) smaller than NCBI Gene and dbSNP, respectively.With its unified format BELB allows to easily test how implementations scale to these large KBs.

Synchronization of KB versions
Entity linking is dynamic by nature: over the years entities in KBs are replaced or become obsolete.For instance, in GNormPlus mentions of "MDS1" are linked to NCBI Gene entity "4197", which was subsequently replaced by "2122".As several KBs do not have a versioning system (see Table 2), it is often not possible to retrieve the exact KB used to create a corpus.Failing to account for these changes may introduce a notable amount of noise in measuring performance of high quality systems.BELB offers access to the KB history if available, i.e. a table tracking all changes in the entities.In our preprocessing we update all corpora with the KB version at hand and remove mentions linked to obsolete entities.This allows as well to update the predictions of systems shipping with a pre-processed (i.e.nontrivially swappable) KB on corpora created after their release, allowing for fair comparison over time.

Evaluated approaches
We use BELB to perform an extensive evaluation of rule-based and neural methods.We now present the selected approaches for evaluation.We stress that we do not re-implement any method (we rely on the original code).

Rule-based entity-specific systems
We compare the performance of neural models linking to KBs for which specialized systems have been developed, as these are still the de facto standard in BEL (Wei et al., 2019;Mujeen et al., 2022).Specifically we test the following rule-based methods: GNormPlus (Wei et al., 2015) for genes (NCBI Gene) , SR4GN (Wei et al., 2012) for species (NCBI Taxonomy) and tmVar v3 (Wei et al., 2022) for variants (dbSNP).For UMLS we employ SciSpacy (Neumann et al., 2019).We label them rule-based entity-specific (RBES) as for linking they rely on a mixture of string matching approaches and ad-hoc rules tailored to a specific entity type.For diseases and chemicals, we include in the RBES category two systems which are only partly rule-based (stretching our definition), as they better represent the state-of-the-art of disease/chemical-specific models.We use TaggerOne (Leaman and Lu, 2016), a semi-Markov model, for diseases, and opt for the system that won the BioCreative VII NLM-Chem track (Almeida et al., 2022) for chemicals (BC7T2W), which uses both string matching and neural embeddings.To the best of our knowledge there exists no linking approach specific for cell lines.We therefore use a fuzzy string matching approach based on Levenshtein distance (FuzzySearch).For detailed descriptions and information on specific implementations we refer the reader to Appendix G.1.All of these systems do not require re-training as either (i) their normalization component is completely rulebased (SR4GN, tmVar, SciSpacy) or (ii) models trained on the BELB corpora are provided along with the code (GNormPlus, TaggerOne, BC7T2W).

Neural systems
We train the following neural models on the train split of each BELB corpus (see Appendix G.2 for training details).BioSyn (Sung et al., 2020) is a dual encoder architecture.Importantly, BioSyn does not account for context, i.e. it uses only entity mentions.The model is trained via "synonym marginalization": it learns to maximize the similarity (inner product) between a mention embedding and all the synonyms embeddings of the gold entity.At inference it retrieves the synonyms most similar to the given test mention, i.e. it relies on a lookup from synonym to entity.We prefer BioSyn over SapBERT (Liu et al., 2021) as the latter is primarily a pre-training strategy.GenBioEL (Yuan et al., 2022) is an encoder-decoder model.As input it takes a text with a single mention marked with special tokens.The system is then trained to generate a synonym.At inference it ensures that the prediction is a valid KB synonym by constraining the generation process with a prefix-tree created from all KB synonyms.Similar to BioSyn, this approach represents KB entities by their synonyms.The authors propose as well "KB-Guided Pre-training", i.e. a method based on prompting to pretrain GenBioEL on the KB, which we however do not deploy.This is because (i) it would introduce an advantage over other neural methods and (ii) it is too computationally expensive to run for each KB. arboEL (Agarwal et al., 2022) is a dual encoder as well.The authors propose to construct k-nearest neighbor graphs over mention and entity clusters.Using a pruning algorithm they then generate directed minimum spanning trees rooted at entity nodes, and use the edges as positive examples for training.At inference they use the entity present in the mention's cluster.Notably, arboEL learns entity embeddings, encoded as a concatenated list of synonyms.The authors use as well a cross-encoder (Wu et al., 2020), i.e. using the top-64 entities retrieved by a trained arboEL as hard negatives (training) and as linking candidates (inference) for a second reranking model.In our experiments we do not make use of this extension as it is not strictly part of the arboEL algorithm.

Evaluation protocol
We now describe in detail the evaluation protocol which we followed in our experiments.For all systems we report the mention-level recall@1 (accuracy), since RBES approaches generate only a single candidate.

Synonym as entity proxy
Approaches using strings as proxies for entities (BioSyn, GenBioEL) cannot meaningfully resolve ambiguous mentions.That is, for a mention of rat "α2microglobulin", they would return a list containing both NCBI Gene "2" (human) and "24153" (rat).Sung et al. (2020) introduced a lenient evaluation, which considers a prediction correct if any of the returned entities matches the gold one.As reported by Zhang et al. (2022), this largely overestimates performance.Following their suggestion, we opt for a standard evaluation, which randomly samples one prediction from the returned list.However, as one of the aims of BEL is direct deployment in extraction pipelines, e.g. for constructing gene networks (Lehmann et al., 2015), we also include a strict evaluation in which all such cases, i.e. multiple predictions, are considered incorrect.

Disentangling recognition and linking
Some RBES systems (GNormPlus, SR4GN, TaggerOne, tmVar) perform entity recognition and linking jointly (see Appendix G.1). Due to false negatives in the NER step we cannot obtain their results on the full test set.To ensure that we are measuring the performance on the same instances for all methods, for corpora whose reference RBES system is a joint approach, we use only the test mentions which are correctly identified during entity recognition.For instance, for NLM-Gene we use only 73% of the test mentions, i.e. those correctly recognized by GNormPlus (see Table 13 for other corpora).As correct recognition correlates with correct normalization, our evaluation protocol probably introduces a bias towards RBES systems (see Section 4).

Multiple gold entities
Mentions in biomedical corpora can provide multiple normalizations.Common instances are composite mentions, e.g."breast and squamous cell neoplasms" and ambiguous ones, e.g."Tolllike receptor" ("Toll-like receptor 2", "4" and "9").Whether these cases are logical AND or OR is not always specified in the annotation guidelines.We opt for the more lenient OR interpretation and consider a prediction correct if it matches one of the gold entities.Table 4 reports the results of neural models and entityspecific models grouped under the RBES category.For results with strict evaluation (Section 2.3.1) and on the full test sets (Section 2.3.2) please see Table 14 and 15, respectively.We observe that performance of neural models varies significantly across entity types, with disease and genes corpora incurring the most significant drop.Homonyms Beside the implicit bias towards RBES approaches (see Section 2.3.2),we hypothesize that an important factor at play are homonyms.RBES systems use ad-hoc components to handle these challenging cases.For instance GNormPlus directly integrates Ab3P (Sohn et al., 2008), a widely adopted abbreviation resolution tool, and SR4GN, which is specifically developed for cross-species gene normalization.Neural models lack these components, and synonym-based approaches are significantly impacted by random selection in case of homonyms.In table 5  has no impact on arboEL.We argue that this is due to the fact that arboEL uses entity embeddings, which benefit less by long forms mentions.Secondly, as entity embeddings require to learn a compressed entity representation, arboEL is affected by the limited size of the corpora.This is supported by results on MedMentions, which is one order of magnitude larger than other BELB corpora, where arboEL is confirmed the state-of-the-art approach.

Unseen entities and synonyms
In Table 6 we see that neural approaches are outperformed by RBES systems on mentions of unseen entities while we the opposite happens with unseen synonyms of train entities.This can be explained by the fact that as string-matching approaches have direct access to the KB they are better suited for the zero-shot cases.If training data is available, neural representation are superior instead, as they can leverage representations learned from context.Scale Neural models implementations fail to scale to large KBs such as NCBI Gene or dbSNP.In our experiments resorted to use the NCBI Gene subset determined by the species of the entities found in the gene corpora (see Table 2).This reflects a common real-world use case scenario, since often only a specific subset of species is relevant for linking (e.g.human and mouse).For dbSNP we are not aware of a valid criterion to subset it and we are therefore unable to run neural models on variants corpora.

Synchronization of KB version
Corpora are only sparsely affected by changes in entities.However, if they are not handled properly, in BC5CDR (C) and Linnaeus (the most affected corpora in BELB), even if evaluating a perfect system, we would register an error rate of 4.56% and 3.57% respectively (see Appendix F).

Discussion
We strived to include in BELB as many corpora and KBs as possible, prioritizing those which are most commonly used by the community.We leave as further improvement the expansion to other important research directions as applications to clinical notes (Luo et al., 2020) and other languages as Spanish (Miranda-Escalada et al., 2022) and German (Kittner et al., 2021).
Our evaluation showed that neural approaches fail to perform consistently across all BELB instances, especially on genes, where RBES approaches are still far superior.However, as reported in Section 2.3.2,our evaluation protocol introduces a bias towards RBES systems by considering exclusively the test mentions they correctly identify.Nonetheless, we believe that this is the best approximation to compare results across all methods.We note as well the lack of hyperparameter exploration in neural models.Due to the high computational resources necessary we rely on the default ones reported by the authors.It is therefore likely that optimizing them may result in better numbers.Further improvements may possible by pre-training on the KB (Liu et al., 2021;Yuan et al., 2022) and refining candidates with a crossencoder (Agarwal et al., 2022).RBES systems are advantaged by the use of ad-hoc components to handle homonyms.In Table 5 we show that introducing similar approaches for neural models could significantly improve their performance.However, as the neural paradigm is based on learning task-related capabilities from data (Bengio et al., 2013), we believe that future studies should nevertheless continue to investigate entity-agnostic models, rather than falling back to custom-made hand-crafted heuristics.

Conclusion
We presented BELB, a benchmark to standardize experimental setups for biomedical entity linking.We conducted an extensive evaluation of rule-based entity-specific systems and recent neural approaches.We find that the first are still the state-of-the-art on entity-types not explored by neural approaches, namely genes and variants.We hope that BELB will encourage future studies to compare approaches with a common testbed and to address current limitations of neural approaches.

A. Neural biomedical entity linking
Studies in BEL have converged to using primarily three corpora: NCBI Disease (Dogan et al., 2014), BC5CDR (Li et al., 2016) and MedMentions (Mohan and Li, 2019).However, as shown in Table 7, where we report a summary of recent approaches, experimental setups differ importantly in terms of the corpora and KBs used, making comparison based solely on published numbers problematic.For instance, in BioSyn the BC5CDR corpus is divided into two, distinguishing between chemical and disease entities, and linked to the CTD vocabularies, while GenBioEL reports results on the entire corpus linking to MeSH,  7. Overview of experimental designs adopted by recent neural approaches for biomedical entity linking.We highlight with a color all cells reporting results which can be compared directly, i.e.: same corpus and knowledge base, abbreviation resolution (Abbr.Res.), pre-training (if any) and evaluation protocol.† No distinction between disease and chemical annotations ‡ Reranking model ♢ Ablation study without pre-training preventig direct comparison.Notably, neural approaches rely on different pre-training strategies with different data sources, making the training signal vary significantly across approaches.This ultimately hinders estimating the difference in performance stemming purely from algorithmic differences.For instance MedWiki outperforms KRISSBERT on MedMentions, but as it is pre-trained on a larger pool of documents, it unclear whether this is due to differences in pre-training or model architecture.Similarly, it is not possible to directly estimate the impact of using abbreviation resolution in the reported performance.Finally, studies differ in the type of evaluation used, with some models deploying a lenient evaluation (see Section 2.3.1)further hindering direct comparison even if the corpus and the KB are the same.

B. Showcase
In Listing 1 we show that with BELB in less than 30 lines of code it is possible to test a simple exact-match approach on all its available pairs of corpus and knowledge base.

C. Corpora and Knowledge Bases
In this section we describe in detail the corpora and knowledge bases contained in BELB, grouped by their entity type: Gene For genes, in contrast to previous approaches (Tutubalina et al., 2020), we use GNormPlus (Wei et al., 2015) instead of the BioCreative II Gene Normalization (BC2GN) corpus (Morgan et al., 2008).This is because BC2GN (i) is limited to human genes and (ii) it provides identifier annotations only at the document level.GNormPlus consists of two corpora re-annotated at the mention level, namely: BC2GN and the Gene Indexing  Assistant (GIA) test collection2 .We devote the GIA test collection as development split since GNormPlus (corpus) does not provide one.We include as well NLM-Gene (Islamaj et al., 2021), which covers ambiguous gene names by including more species.As it because (i) some corpora were specifically linked to them (e.g.NCBI Disease) and (ii) with UMLS containing many unrelated concepts, it would unnecessarily increase the search space.
In Table 8 we report links to all resources included in BELB along with the license with which they are released.All resources created by NLM, as a governmental agency, are by nature of public domain3 .Many corpora do not specify a license (we mark them as N/A) but can be freely accessed and do not specify any limitation on the data usage (except in forbidding commercial use).The same considerations hold for the knowledge bases.UMLS is the only resource which is not freely available and users are required to enter a Data Usage Agreement (DUA).

Semantic Group
Entity mentions (%)  In Table 9 we report the ranked number of entity mentions in MedMentions (ST21PV) grouped by their UMLS Semantic Groups4 .As reported in Section 2.1.1 entity types such as diseases ("Disorders") and species ("Living Beings") are covered only marginally, while genes ("Genes & Molecular Sequences") are completely absent.Notably, a high proportion of mentions (∼47%) is devoted instead to general entities such as "Phenomena" and "Procedures".

E. Unified schema for knowledge bases
In Listing 2 we provide an overview of the unified schema used to store all knowledge bases in BELB.In its basic form, a KB is a list of synonyms (names), each associated with a single entity.All KBs provide as well information about each name ("description").For instance in NCBI Taxonomy a name can be the "scientific documentation/SemanticTypesAndGroups.htmlname" of a species ("Homo Sapiens") or the "common" one ("human").In all KBs exclusively one name must be the primary one, i.e. the one most commonly used to refer to the concept represented by the entity.For instance, the primary name (a.k.a symbol) for entity "2" in NCBI Gene is "A2M".Some of the KBs are interconnected.For instance, "α2microglobulin" represents a different entity when referring to the human or rat gene.Thus, besides having different identifiers ("2" and "24153" respectively), to ease downstream applications (e.g.data integration), NCBI Gene provides what we call (borrowing from the database jargon) a foreign entity.For instance, all entries in NCBI Gene with identifier 2 are accompanied by the foreign entity "9606", i.e. the entity in NCBI Taxonomy denoting "Homo Sapiens" (human).If available, a KB provides as well a history table, where changes to the identifiers are tracked, i.e. if they have been replaced by others or have become obsolete.In Table 10 we report the number of changes in entity label in the test split of each BELB corpus.For each gold label associated to an entity mention there can be two types of changes.Either the label has been replaced (Replaced), and in this case it can be updated, or it was removed from the KB (Removed), which makes the label obsolete and the mention not linkable, in which case we exclude the entity mention from the test set.

G. Biomedical entity liking systems
In Table 11 we report all systems (and the link to their original implementation) taken into consideration in our benchmarking.

G.1. Rule-based entity-specific systems
In Table 12 we provide an overview of the rule-based entity-specific models.Implementation details for each model are reported below.
GNormPlus uses a CRF Model to perform entity recognition.The normalization component is a statistical inference network based on TF-IDF frequencies.The system comes with two pretrained models, namely: "GNR.Model", which was trained on the train and development split (as defined by BELB) of GNormPlus iand "GNR.GNormPlusCorpus NLMGeneTrain.Model", which was trained on the whole  GNormPlus corpus and the train and development split (as defined by BELB) of the NLM-Gene corpus.We use the first one when evaluating on GNormPlus and the second on NLM-Gene.SR4GN (Species Recognition for Gene Recognition) is a rulebased system which, as the name suggests, is mainly a support component for gene normalization.It implements several custom rules to address cases where species information is not explicitly available.We run GNormPlus as well on the species corpora as SR4GN is only available as a GNormPlus component.tmVar uses pattern matching to both recognize and normalize variants, falling back to dictionary-lookup when the first fails.
It is distributed with a pre-trained model and a pre-processed version of dbSNP.Its usage depends on GNormPlus, as it requires normalized gene mentions to perform linking.
TaggerOne is a general purpose joint recognition and linking system based on a semi-Markov model.It provides two models: "model NCBID.bin" and "model BC5CDRD.bin" which are trained on the train split of NCBI Disease and the disease annotations of BC5CDR respectively.Wei et al. (2019) reports to use a TaggerOne model trained on BioID.They however do not publicly release the trained model.Our attempt to use TaggerOne training code resulted in a error.
BC7T2W is a hybrid model based on dictionary-lookup and BioBERT embeddings (Lee et al., 2020).By default the systems

Table 1 .
Overview of the corpora available in BELB with their primary characteristics: number of documents, annotations and how many of them are zero-shot (unseen entities) or stratified (seen entity but unseen name).‡ Full text ‡ Figure captions

Table 5 .
we show that if we resolve abbreviation with Ab3P there is a notable improvement in performance for diseases.Similarly, if we use a lenient evaluation (see Section 2.3.1),GenBioEL is almost on par with GNormPlus on genes.In contrast, abbreviation resolution Relative improvement of neural models with resolved abbreviations and a lenient evaluation in case of multiple predicted entities.OOM: out-of-memory (>200GB)

Table 9 .
Ranked number of entity mentions (and relative amount) in MedMentions (ST21PV) grouped by UMLS Semantic Groups.

Table 10 .
Overview of the number of changes in entity labels in the test split of each corpus in BELB.
Listing 2 Example SQL code defining the BELB schema for all knowledge bases.

Table 11 .
Overview of the biomedical entity linking systems benchmarked on BELB.

Table 12 .
Overview of the rule-based entity-specific baselines benchmarked on BELB.