Critical assessment of transformer-based AI models for German clinical notes

Abstract Objective Healthcare data such as clinical notes are primarily recorded in an unstructured manner. If adequately translated into structured data, they can be utilized for health economics and set the groundwork for better individualized patient care. To structure clinical notes, deep-learning methods, particularly transformer-based models like Bidirectional Encoder Representations from Transformers (BERT), have recently received much attention. Currently, biomedical applications are primarily focused on the English language. While general-purpose German-language models such as GermanBERT and GottBERT have been published, adaptations for biomedical data are unavailable. This study evaluated the suitability of existing and novel transformer-based models for the German biomedical and clinical domain. Materials and Methods We used 8 transformer-based models and pre-trained 3 new models on a newly generated biomedical corpus, and systematically compared them with each other. We annotated a new dataset of clinical notes and used it with 4 other corpora (BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC) to perform named entity recognition (NER) and document classification tasks. Results General-purpose language models can be used effectively for biomedical and clinical natural language processing (NLP) tasks, still, our newly trained BioGottBERT model outperformed GottBERT on both clinical NER tasks. However, training new biomedical models from scratch proved ineffective. Discussion The domain-adaptation strategy’s potential is currently limited due to a lack of pre-training data. Since general-purpose language models are only marginally inferior to domain-specific models, both options are suitable for developing German-language biomedical applications. Conclusion General-purpose language models perform remarkably well on biomedical and clinical NLP tasks. If larger corpora become available in the future, domain-adapting these models may improve performances.


INTRODUCTION
In many countries, a considerable portion of clinical routine information is still not gathered in a structured format. While structured data are commonly utilized for health economics and registries, it often lacks specific information, such as descriptions of adverse drug events, disease severity, family history, or behavioral and environmental health determinants. Such information is predominantly documented in clinical free-text form, which makes up to 40% of the data generated in current hospital systems. 1 The great potential of information documented in narrative text to support translational research and the implementation of clinical applications was recognized early, [2][3][4] but exploiting that potential still poses a challenge. Extracting clinical information through natural language processing (NLP) methods could structure that information to support downstream clinical applications such as deep phenotyping, betterindividualized clinical decision-making, and automated coding for health economic purposes.
Nowadays, the development of NLP systems for information extraction in English is already quite advanced. Systems such as MedLEE, 2,5 MetaMap, 6 cTAKES, 7 and CLAMP 8 have been developed and deployed in the past to extract information from clinical narrative texts. Furthermore, open competitions such as Informatics for Integrating Biology and the Bedside (i2b2), 9 National NLP Clinical Challenges (n2c2), 10,11 and CLEF eHealth 12 encourage sharing of data and models and are further driving developments in this area. The systems developed so far include rule-based, machinelearning-based, and hybrid models. While rule-based approaches were indispensable in the early stages, today's research often focuses on machine-learning methods. In particular, deep-learning networks, such as recurrent neural networks (RNNs) or convolutional neural networks, have been used extensively in recent years 13 as they can achieve higher performances if sufficient amounts of training data exist. Compared to traditional machine-learning methods, deep neural networks usually employ methods such as Word2-Vec, 14,15 GloVe, 16 or FastText 17 to represent words as vectors. These methods model language by learning relationships between words -so-called word embeddings -from a large textual corpus. Using the word embeddings as features replaces the manual feature engineering required by traditional methods. Following the idea of word vector representation, research continued and led to the devel-opment of another group of deep neural networks -transformerbased models. The Transformer, published by Vaswani et al. in 2017, 18 was initially designed for neural machine translation and addressed two shortcomings of RNNs: missing parallelization and long-range dependencies. It relies heavily on the self-attention mechanism, which weighs each part of the input differentially. Since it works without recurrence, it is parallelizable and computationally more efficient than the RNN counterpart. In 2019, Devlin et al. used parts of the original architecture to develop Bidirectional Encoder Representations from Transformers (BERT) and achieved state-of-the-art results in numerous NLP tasks. 19 As with other transformer-based models, it is trained in 2 stages: First, it is pretrained using large amounts of unlabeled data by applying novel training objectives such as masked language modeling (MLM) and next-sentence prediction. In the second stage, the model is finetuned for specific NLP tasks with labeled data. Since the publication of BERT, numerous variants of the model have been presented. While approaches such as RoBERTa 20 and ELECTRA 21 tackled potential limitations and shortcomings of the model architecture and training procedure, other variants such as BioBERT 22 and Clini-calBERT 23,24 were developed to achieve domain specificity.
In the German-speaking world, developments lag far behind and are often driven only by commercial software or local applications. 25 Strict data protection laws hinder data sharing, and thus clinics typically only allow for the use of data internally. These factors inhibit the sharing of datasets and models, as well as the hosting of open challenges with German datasets. 25,26 Nevertheless, there have been promising approaches in recent years: With JSynCC 27 and GGPONC, 28 2 datasets have been published that contain texts with biomedical language but are not affected by data protection issues. Recently, the first corpus containing de-identified discharge letters, called BRONCO150, 29 was published. Furthermore, the CLEF eHealth challenge provided a dataset of non-technical summaries of animal studies in 2019. S€ anger et al. used the multilingual BERT version (mBERT) to classify these summaries and showed that mBERT significantly outperformed a baseline Support Vector Machine model. 30 Later, Bressem et al. trained domain-specific BERT models using 3.8 million radiographic reports and evaluated them in a classification task with promising results. Similarly, Richter-Pechanski et al. pre-trained BERT models on 200 000 dis-charge letters and fine-tuned them for a clinical concept extraction task. General-purpose language models (GPLMs) have already performed excellently in all of these cases. However, none of these studies systematically compared already published models such as GottBERT or GELECTRA, but rather focused on mBERT or Ger-manBERT. Furthermore, none of the pre-trained clinical models are publicly available yet.
In our work, we developed 3 new biomedical domain-specific GPLMs and evaluated their performance on 5 clinical NLP tasks in comparison to 8 GPLMs. For this purpose, we first assembled a dataset of unlabeled biomedical texts and trained our models. We then annotated clinical entities in 50 discharge letters to generate a new dataset called ChaDL (Charit e Discharge Letters), which we used with BRONCO150, the CLEF eHealth dataset from 2019, GGPONC, and JSynCC to fine-tune and evaluate models. To our knowledge, this is the first comprehensive comparison of Germanlanguage transformer models for clinical NLP applications.

General overview
The work described in this article consisted of 3 phases ( Figure 1): 1. Annotation of ChaDL: We manually annotated 50 de-identified discharge letters from the Charit e -Universit€ atsmedizin Berlin with respect to the entities diagnosis, disorder, dosage, intake, medication, and procedure. 2. Pre-Training: Subsequently, we pre-trained several transformer models on German-language scientific abstracts, drug leaflets, and medicine-related Wikipedia articles. 3. Fine-Tuning: Finally, we performed fine-tuning and evaluation of eleven models for named entity recognition (NER) and document classification based on 5 corpora, including ChaDL.
In the following, we describe our approach in more detail.

Datasets
We used 6 different datasets for pre-training and fine-tuning of transformer models. The corpora we compiled for pre-training consisted of German medical articles from Wikipedia, drug leaflets from the AMIce database (https://www.dimdi.de/dynamic/de/arzneimittel/arzneimittel-recherchieren/amis/), and scientific abstracts from the LIVIVO search engine. 33 For the latter, we only used abstracts from databases with biological or medical relevance. All elements such as lists, tables, and equations that can confuse text mining systems were removed from the documents. As shown in Supplementary  28 and JSynCC, 27 and a newly created dataset of clinical discharge letters called ChaDL that originated from Charit e -Universit€ atsmedizin Berlin.
JSynCC is the first publicly available dataset with documents in the German clinical language. It contains 867 documents extracted from 10 medical textbooks (see Table 1). Since each document is assigned to one or more specialized medical fields, this dataset is suited for a multi-label document classification task. Nonetheless, the class distribution is highly imbalanced, and most labels are only represented a few times (see Supplementary Figure A.5). Since a model can neither be adequately trained nor evaluated if classes are this scarcely represented, we generated 2 subsets of JSynCC in which we excluded classes whose frequency does not exceed a specified threshold. Version A represents the extreme case in which only a few samples are available to train a model: We kept all document labels that occurred at least 5 times, thereby reducing the number of documents from 867 to 849. For version B, which is closer to a realworld scenario with more samples available for training, we limited labels to those that occur at least 50 times, thereby reducing the total documents from 867 to 494. The main article shows the results of our experiments with version B. A detailed description of version A and the respective results are available in Supplementary Appendix C.2.
As part of 2019s CLEF eHealth challenge, a dataset comprising 8793 German NTPs of animal experiments was made available. The documents have been manually annotated by experts; each has received zero or more ICD-10 codes as document-level label. Like JSynCC, we used it for a multi-label document classification task.
GGPONC contains 8414 text segments that have been extracted from 25 oncology clinical practice guidelines and hence is one of the largest corpora of German medical texts. Borchert et al. automatically annotated the corpus with 7 UMLS terms and screened for TNM expressions and gene names. Afterward, 4 annotators manually curated a subset of 4153 text segments to generate a gold standard. In this study, we used only the 4153 manually curated text segments for our experiments.
As the first freely available corpus of de-identified clinical notes, the recently published Berlin-Tü bingen Oncology corpus (BRONCO150) contains shuffled sentences from 150 German oncological discharge summaries. Nine annotators (medical experts and students) annotated the documents using the labels diagnosis, treatments, medication, and other attributes.
Our newly created dataset ChaDL consists of 50 de-identified discharge letters from the neurological department of the Charit e -Universit€ atsmedizin Berlin, collected as part of studies in which informed consent was given to extract data from the hospital information system. These discharge letters contain various sections from which we focused on anamnesis, diagnoses, medication, and epicrisis. We used the annotation tool INCEpTION 35 to manually annotate the mentions for diagnostic, disorder, dosage, intake, medication, and procedure entity classes (see Supplementary Material Section A.2 for details of the annotation process). These entity classes were chosen to capture detailed information about patients' examination, health condition, and treatment. The majority of the discharge letters were annotated by only 1 annotator; however, 20% were annotated by a second expert to determine the quality of manual annotation by calculating the inter-annotator agreement score Krippendorff's alpha. On average, we achieved a score of 0.76 6 0.11, indicating a relatively high agreement between the 2 annotators.

Published transformer models
We focused our experiments on the 3 transformer-based model architectures BERT, ELECTRA, and RoBERTa.
BERT 19 is a bidirectional transformer-based encoder model, which is pre-trained on large amounts of unlabeled data using MLM and next sentence prediction (NSP) jointly as training objectives. During MLM, some input tokens are randomly masked and the objective is to predict the original tokens based only on their con-text. The NSP task is to determine if 2 sentences are consecutive or not.
RoBERTa 20 is an optimized version of BERT. It is built on the same architecture as BERT but abandons the NSP objective and only uses masked-language modeling for pre-training. Unlike BERT, however, the data are not masked statically during preprocessing but dynamically during each epoch. In addition, some hyperparameters such as the batch size and the tokenizer have been changed.
ELECTRA 21 uses the same architecture as BERT but differs in its pre-training procedure. While BERT aims for MLM and NSP,  Figure 1. Study overview. First, a set of 50 discharge letters was annotated with medical entities. Second, biomedical transformer models were pre-trained on a newly assembled biomedical corpus by either training it from scratch or through domain adaption of an existing model. Third, the pre-trained models were compared to 8 published models on 5 fine-tuning tasks.
Note: Classes that are not present in one of the datasets are denoted with "-". For the CLEF eHealth 2019 dataset, we only report the number of documents, sentences and tokens, as more than 200 possible labels exist. a Number of sections which were extracted from the discharge letters.
ELECTRA uses a method called replaced token detection (RTD). Two separate models are used for this purpose: a generator and a discriminator. The generator is trained by MLM, and its output is then used as input for the discriminator. The discriminator has to predict whether a token has been replaced or whether it is the original input. After pre-training, only the discriminator is used. Table 2 lists the models we used in this study and provides information on the data used for pre-training. All German language models were trained on general language corpora consisting of Wikipedia articles, books, news articles, or vast amounts of crawled textual data.
Training and assessment of language models for the German clinical domain Pre-training We followed 2 strategies to pre-train transformer models specific to the German biomedical domain. First, we used an existing RoBERTa-based model, named GottBERT, for domain adaption, and second, we trained 2 newly initialized ELECTRA-based models from scratch.
For the domain-adapted GottBERT model, we loaded the pretrained model and trained it on our biomedical corpus with static masked-language modeling and linear learning rate scheduling. A detailed list of used hyperparameters can be found in Supplementary For the ELECTRA models, we used our biomedical corpus to generate a new vocabulary for the WordPiece tokenizer. 43 Then, we initialized 2 new ELECTRA models in both the small and base configurations and subsequently trained both with the hyperparameters specified in Supplementary Table B.1. We refer to these 2 models as BioELECTRA-small and BioELECTRA-base, respectively.

Performance assessment
We assessed the performance of the 8 published and the 3 pretrained transformer-based models on 2 types of downstream tasks, document classification and NER.
The JSynCC and the CLEF eHealth datasets (see Table 1) were used to evaluate the models for multi-label document classification tasks. For the transformer-based models, the documents were split into one or more sequences of 512 tokens. If multiple instances existed per document, max-pooling was applied to the logits before loss calculation and final classification.
For the NER task, BRONCO150 (we used the same 5 outer folds as the authors to evaluate the model performances), GGPONC, and ChaDL (see Table 1) were used for the performance assessment. The data were prepared according to the BILOU tagging scheme, and the performance was assessed at the entity level.
In all fine-tuning studies, we fine-tuned the transformer-based models and compared their performances to a baseline. In the case of the CLEF dataset, we compare performance to the best result the challenge organizers provided. In all other experiments, we trained a bidirectional LSTM network with a Conditional Random Field (Bi-LSTM-CRF). When we trained the models for the CLEF dataset, we used the train, validation, and test splits from the original tasks. In all other cases, we performed 5-fold nested cross-validation to assess the performance of the models. We used the Optuna hyperparameter optimization framework 44 to optimize hyperparameters such as the batch size, learning rate, and weight decay (see Supplementary  Table B.2 for details) by maximizing the micro F 1 -score. We trained for a maximum of 50 (BRONCO150, ChaDL, GGPONC, JSynCC) or 80 (CLEF eHealth 2019) epochs but used an early stopping procedure to stop after 15 epochs if performance did not improve (DF 1 < 0:01); the best model was used for evaluation in the end.

Implementation
The tokenizers and transformers libraries developed by the Hugging-Face team were used for pre-training and fine-tuning experiments of the transformer-based models. For the training of the Bi-LSTM-CRF model, we used the flair framework with GloVe and flair embeddings. [45][46][47] For pre-training, we utilized up to 4 NVIDIA V100 or A100 GPUs. In all other cases, single NVIDIA V100 or A100 GPUs were used.
We used several libraries to calculate metrics: The kAlpha (https://github.com/emerging-welfare/kAlpha, accessed on November 24, 2021) implementation was used to calculate Krippendorff's Alpha for the inter-annotator agreement. The metrics for the multilabel document classification tasks were calculated with the classifi-cation_report function from scikit-learn (version 0.23.2), 48 and the metrics for the NER tasks were calculated with classification_report function from the seqeval library (version 1.2.2). 49

RESULTS
In this study, we show the assessment results of general-purpose and domain-specific language models for the German clinical domain. We begin by presenting the pre-training results of the 3 models. Then, we highlight the fine-tuning performance of these 3 newly pre-trained and 8 already-published models on the 5 fine-tuning tasks. Figure 2 shows the pre-training metrics of the 3 new models. In the case of BioGottBERT, where we followed a transfer-learning approach and initialized it with the GottBERT parameters, the MLM accuracy increased from 75 to 82.0%. Unfortunately, a direct comparison of the BioGottBERT metrics and those of the BioELECTRA-small and BioELECTRA-base models is problematic since different training objectives were followed. For these 2 models, there are 2 measures, namely MLM and RTD accuracy. In both cases, the generators' MLM accuracy starts at 0% and moves, after an initial sharp increase, to 54%, and 70% for the small and base models. On the other hand, the discriminators' RTD accuracy starts at close to 100% and deteriorates to 39% for the base model, whereas in the small model, it ends at 99%. A subsequent examination of the training's environmental impact revealed that the training of BioELECTRA-small and BioGottBERT required comparable amounts of energy; however, BioELECTRA-base required approximately 4 times more (see Supplementary Appendix B.2). Table 3 depicts the results of the document classification tasks on the CLEF eHealth 2019 and JSynCC (see Supplementary Table C.4 for the results of subset version A) datasets. For JSynCC, all models, including the Bi-LSTM-CRF model, achieved very high F 1 -scores ranging from 89.0 to 92.7%. The greatest F 1 -scores were obtained by GBERT, mBERT, and GermanBERT, with no significant difference between them. When applied to the CLEF eHealth dataset, the differences between the results increased substantially. Our finetuned variant was slightly inferior to S€ anger et al's mBERT model (DF 1 ¼ À1:2); however, GottBERT and GBERT reached a compara-ble result. In both cases, our pre-trained BioELECTRA and BioGott-BERT models were outperformed by the top-performing GBERT model.

Fine-tuning performance
The results of the 3 NER tasks, in which various medical entities are detected in BRONCO150, ChaDL, and GGPONC corpora, are summarized in Table 4. In contrast to the GGPONC dataset, the model performances vary considerably on BRONCO150 and ChaDL datasets.
For the BRONCO150 dataset, F 1 -scores between 46.7 and 83.2% were observed. The BioELECTRA-small, BioELECTRAbase, and mBERT models achieved the lowest performances with a gap of 36.5, 19.1, and 20.7% to the best model, respectively. All other models showed more similar performances and achieved For the ChaDL dataset, we observed a diverse performance. The 2 BioELECTRA models performed poorly as seen on BRONCO150 dataset (61.1 and 55.3% for the small and base model, respectively). Similarly, the ClinicalBERT model, which was fine-tuned using a translated version of the ChaDL corpus (see Supplementary Material Section A.2.3 for details of the translation process), reached a low score of 44.4%. The F 1 -scores of the remaining models ranged between 61.4 and 80.4%, and as before, BioGottBERT scored best. The top-performing models, BioGottBERT, GottBERT, and GELECTRA, outperformed our Bi-LSTM-CRF model.
The results obtained on the GGPONC dataset are for most models in a more similar range (79.4-83.9% without BioELECTRA-  Given all results, we conclude that not all transformer-based models are equally suited for biomedical and clinical applications. For the document classification tasks, we identified GBERT as the best-performing model. Our pre-trained BioGottBERT, the published GottBERT, and GELECTRA models were the best performing models for the NER tasks. In contrast to the BioGottBERT model, the newly trained BioELECTRA models proved ineffective. Except for the JSynCC dataset, the base model performed signifi-  cantly worse than most other models. The small model performed well on CLEF, JSynCC, and GGPONC but was inferior for the 2 clinical datasets, BRONCO150 and ChaDL.

DISCUSSION
Clinical notes represent a vital resource for communication between medical experts. As information hidden in clinical notes has a high potential to support medical research and clinical applications, the accurate extraction and structuring of such patient information are essential. For this purpose, novel systems are needed that are specifically designed for the clinical domain. This study addressed the applicability of publicly-available transformer-based language models for the German clinical language domain. Furthermore, we developed new biomedical models by pre-training them on a large biomedical corpus, and we systematically assessed their performances compared to 8 further GPLMs.
One contribution of this study is the development of 3 new transformer-based language models which we trained on a newly compiled corpus of biomedical text. As described in the Results section, the domain-adapted BioGottBERT achieved -in agreement with our expectations -a higher MLM accuracy than the initial GottBERT model, implying a better understanding of biomedical language. On the other hand, the pre-training of the 2 BioELECTRA models displays unexpected behavior. As described previously, the base model achieved a higher MLM accuracy than the small model. In contrast, the final RTD accuracy of the base model was much lower than the small models', implying that the base model's generator predicted masked tokens more accurately, complicating the discriminators' task to differentiate original and replaced tokens. Meanwhile, the lower performance of the small models' generator made the discriminators' job easier.
Furthermore, we created ChaDL, a new clinical dataset for NER. We annotated 50 discharge letters with medical terms and achieved satisfactory quality according to the calculated interannotator agreement score. In addition, we utilized the BRONCO150, CLEF eHealth 2019, GGPONC, and JSynCC datasets. Although the nature of the datasets varies, it is helpful to use all of them in order to evaluate a broad range of biological language understanding. By using clinical and biological datasets, we followed the example of the English benchmark Biological Language Understanding Evaluation (BLUE). 50 While the GGPONC, JSynCC, and CLEF eHealth 2019 datasets are based on clinical guidelines, fictional text, or NTPs, BRONCO150 and ChaDL are based on discharge letters and, therefore, are more important to assess the performance for clinical applications. While BRONCO150 contains more discharge letters (150 vs 50), ChaDL benefits from the integrity of the entire documents rather than single, randomly mixed sentences. Therefore, we believe that ChaDL reflects real-world clinical applications more accurately than the other datasets.
The final contribution is the systematic comparison of all mentioned models. The fine-tuning results for the 5 datasets indicated positive effects of domain adaption. BioGottBERT outperformed GottBERT on BRONCO150 and ChaDL while being only marginally inferior on the GGPONC dataset. However, the pre-training from scratch showed no positive effects for the 2 BioELECTRA models, which were strongly outperformed by all other models on the 2 clinical datasets, BRONCO150 and ChaDL. The domainadaption's lower environmental impact (see Supplementary Appendix B.2) provides further support for this strategy.
The overall results of this study align well with previous studies. On the one hand, it has been shown by Bressem et al. 31 and Richter-Pechanski et al. 32 that training from scratch led, so far, to lower performances compared to GPLMs and is, therefore, not advantageous. On the other hand, it has been shown that domain-adapted models can have improved performance compared to the initial model. 22,24,31 For instance, Rad-BERT achieved on average a 2% higher AUC than the initial GermanBERT model on the classification of chest radiograph reports, and in the English domain using BioBERT instead of BERT on the NCBI disease dataset increased the F1 score by 1.1%.
We believe that the low performance of newly trained models is mainly due to the relatively small size of available pre-training corpora. Compared to GermanBERT, we only had about 6.7% of the data used for pre-training, and in the case of GottBERT, it was only 0.5%. Training models from scratch proved unsuccessful with such a limited amount of data. Nevertheless, we see a need to compile a larger German biomedical corpus in the near future so that the limits of German biomedical NLP models can be pushed further using domain-adaption strategies.
Aside from the encouraging results for the domain-adaption strategy, our study also confirms that GPLMs perform surprisingly well on clinical NLP tasks. In particular, GBERT achieved excellent results for the document classification tasks, while GottBERT and GELECTRA excelled for the NER tasks. Although domain-specific models will most likely outperform unspecific language models when larger corpora of biomedical texts are available, these models seem to be well suited as a first approach for conducting research when domain-specific models are unavailable. Furthermore, we found that the best transformer-based models outperformed Bi-LSTM-CRF models when applied to BRONCO150, ChaDL, and GGPONC, which demonstrates the potential of these models for the development of biomedical NLP applications.
To completely comprehend a model's capability for clinical applications, we suggest conducting additional research to evaluate German language models on relation extraction, question answering, and named-entity normalization tasks. In this regard, it would be ideal for further gathering a diverse set of publicly available datasets for a German analog of the BLUE, 50 allowing direct comparison of future models.

Limitations
While conducting our work, we faced 2 main limitations: First, there is a small amount of pre-training data we acquired. Access to German clinical documents for scientists is often severely restricted if the studies are not carried out at a hospital. Similarly, biomedical data are not as abundant as in the English language. Focusing on drug leaflets, Wikipedia, and scientific abstracts, we only retrieved 0.8 GBs of textual data, which hindered the pre-training of a transformer-based model for the biomedical domains.
Second, German biomedical and clinical datasets are rare, and there is no standardized benchmark for performance assessment. As already reported in prior studies, 51 there are large differences between English and non-English resources. While no datasets were available a couple of years ago, we now have access to 4 public datasets, BRONCO150, the CLEF eHealth dataset, GGPONC, and JSynCC. In this study, we used them alongside our dataset ChaDL for the evaluation. While all of them are suited for a performance comparison of several models, some are still subject to restrictions. JSynCC suffers from class imbalance, BRONCO150 contains some very short training samples due to the fragmentation into shuffled sentences, and ChaDL consists of relatively few clinical documents.

CONCLUSION
In this study, we investigated the performance of both generalpurpose and newly trained domain-specific transformer-based models for the German-language biomedical domain. On the one hand, our findings indicate that training new models from scratch with a small amount of biomedical data are currently ineffective and results in models that are inferior to existing models. On the other hand, we observed that previously published general-purpose models performed remarkably well on the biomedical named-entity recognition and document classification tasks. We were able to slightly enhance performances by domain-adapting an existing model, showing that the domain-adaptation strategy has potential. If larger corpora for the biomedical domain were to become accessible in the future, the boundaries of German biomedical NLP models may be pushed even further by domain adaptation.