An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C)

Abstract Despite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for Coronavirus Disease 2019 (COVID-19) signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.


Introduction
Over the past decade, Electronic Health Record (EHR) systems have been increasingly implemented at US healthcare institutions.Large amounts of detailed longitudinal patient information, including lab tests, medications, disease status, and treatment outcomes, have consequently been accumulated and made electronically available.These large clinical databases are valuable data sources for clinical and translational research.As a result, major initiatives have been established to exploit this crucial resource, including the Clinical and Translational Science Awards (CTSA) Program's National Center for Data to Health (CD2H)/National COVID Cohort Collaborative (N3C) 1,2 , the Electronic Medical Records and Genomics (eMERGE) Network 3 , the Patient-Centered Outcomes Research Institute's (PCORI) Clinical Research Networks (CRNs) 4 , the NIH All of Us Research Program 5 , and the Observational Health Data Science and Informatics (OHDSI) Consortia with demonstrated successes 6,7,8,9 .
One common challenge faced by those initiatives is, however, the prevalence of clinical information embedded in unstructured text 10 .Compared to structured data entry, text is a more conventional way in the healthcare environment to document impressions, clinical findings, assessments, and care plans.Even with the advent of sophisticated EHR systems, studies have shown that capturing health information fully in structured format through data entry is unlikely to happen and a blended model where physicians use templates when and where possible and dictate the details of a patient visit in text 11 .
Natural language processing (NLP) has been promoted as having a great potential to extract information from text 12 .NLP algorithms can generally be categorized into using either symbolic or statistical methods 13 .Since the turn of the century, machine learning algorithms (i.e., statistical NLP) have gained increased prominence in clinical NLP research 14 .Nevertheless, a substantial portion of clinical NLP use cases leverages symbolic techniques given that dictionary or rule-based methodologies suffice to meet the information needs of many clinical applications under specific use cases.In the context of EHR-based clinical research, NLP has been leveraged to assist information extraction and knowledge conversion at different stages of research including feasibility assessment, eligibility criteria screening, data elements extraction, and text data analytics.As a result, an increasing number of clinical research benefits from state-of-the-art NLP solutions and have been reported ranging from disease study areas 15,16,17,18 to drug-related studies 19,20 .A majority of existing clinical NLP studies are, however, done within a monoinstitutional environment 13 , which may suffer from limited external validity and research inclusiveness.Compared with single-site research, multisite research potentially offers larger sample size, more adequate representation of participant demographics (e.g., age, gender, race, ethnicity, and social-economic status), and more diverse investigator expertise, which may ultimately yield a higher level of research evidence 21,22,23,24 .
Despite a plethora of recent advances in adopting NLP for clinical research, there have been barriers towards adoption of NLP solutions in clinical and translation research, especially in multisite settings.The root causes of these barriers can be categorized into two major reasons: 1) heterogeneity of ETL (extract, transform, load) processes between differing sites with their own disparate EHR environments, and 2) human factor variation in gold standard corpus development processes.

ETL Process Heterogeneity. The challenges faced by NLP development and evaluation to
facilitate the secondary use of EHR data originate from the complex, voluminous, and dynamic nature of the data being documented and stored within a heterogeneous set of disparate, institution specific, EHR implementations.Variations in EHR system vendors, data infrastructure (e.g., unified, ontology driven, and de-centralized), and institutions' modes of operation can lead to idiosyncratic ways of clinical documentation, transformed, and representation 25 .Collecting these data would require a significant expenditure of effort to locate, retrieve, and link EHR data into a specific format 26 .This variability in ETL processes required to support a high level of data heterogeneity brings additional challenges in the adoption of NLP for clinical and translational research, which substantially limits both the cross-institutional interoperability of developed NLP solutions and the reproducibility of the associated evaluations.
Human factor variation in gold standard corpus development process.The process of developing, evaluating, and deploying NLP solutions in both mono-and multi-site environments can be task-specific, iterative, and complex, often involving a multitude of stakeholders with diverse backgrounds 13,26 .A key step prior to model development is corpus annotation, the process of developing a gold standard by marking the occurrence of both task-defined sets of clinical information as well as their associated interpretative linguistic features (e.g., certainty, status) within text documents.Due to the complexity of clinical language, creating such gold standard corpora requires significant expenditure of domain expertise and time as clinical experts regularly make decisions directly affecting study cohort, annotation guideline, and task definitions.Studies have discovered potential biases in clinical decision making and interpretation of clinical guidelines 27 , in coding of clinical terminologies 28 , and in interpretation of imaging findings 29 .This issue can be further exacerbated when conducting multi-site collaborations due to inter-site variations in care practice 30,31 , ultimately affecting the validity and reliability of the resulting gold standard corpus.A coordinated, transparent, and collaborative platform is therefore needed to promote open team science collaboration in NLP algorithm development and evaluation through consensus building, process coordination, and best practice sharing.Built upon our previous work 32,33 , here, we proposed an open NLP development framework to address the aforementioned issues through the following components: 1) an interoperable NLP infrastructure for incorporation of different NLP engines utilizing a clinical common data model for data source interfacing and representation with the aim of reducing the impact of the heterogeneity of ETL processes; 2) a transparent multi-site participation workflow on corpus development and evaluation with the aim of addressing the variation in data abstraction and annotation processes between sites; and 3) a user-centric crowdsourcing interface for collaborative ruleset development that enables effectively and efficiently gathering, synthesizing, and fusing site-specific knowledge and findings.To demonstrate the viability of the framework, we conducted a case study where we developed, evaluated, and implemented an NLP algorithm for extracting 34,35,36 COVID-19 signs and symptoms to support the National COVID Cohort Collaborative (N3C).

Framework Description
The framework itself consists of a data ingestion layer, a processing layer, and a data persistence layer.The architecture of the proposed framework is illustrated in Figure 3.The data ingestion layer works as the data collector with the ability to read text from a configurable variety of data sources such as relational databases or file systems including and load them into the NOTE table of OMOP CDM.The processing layer serves as the NLP engine where information extraction from raw texts happens given a set of heuristic rules created by various NLP engines.By default, as an example implementation, the MedTagger 37 NLP engine is provided, although alternative NLP engines can be substituted by wrapping their respective NLP pipelines to conform to a provided API specification.After the term modifiers added by contextual rules from ConText Algorithm 38 around the extracted condition mentions, these conditions will compose clinical events with temporal information.The reason we opt for a symbolic solution is due to its simplicity, transparency, and interpretability as the outcomes are fully deterministic based on the definition of the rules.When the baseline rulesets and dictionaries are made available to the public, they can therefore be easily refined by different users from different sites.The data persistence layer stores resulting extracted NLP artifacts in the OMOP CDM NOTE_NLP table as the events are extracted from NLP systems.

N3C Case Study
NLP Algorithm Development and Evaluation: Table 1 shows the annotation corpora statistics.
A COVID-19 sign/symptom ruleset was produced consisting of 17 concepts.The IAA of the annotated corpus was 0.686 F1-score for Mayo, 0.516 for UMN and 0.211 for UKen.Two NLP algorithms were evaluated in this study.One was developed based solely on the narratives sourced from a single site (Mayo Clinic).The other used the resulting NLP algorithm from the single site and fine-tuned based on the annotated training data from an additional two sites (UMN and UKen).
Table 2 shows the performance of the single-site NLP algorithm and Table 3 shows the performance of the multi-site NLP algorithm.The single-site ruleset resulted in performances of 0.876, 0.706, and 0.694 in F-scores for Mayo, Minnesota, and Kentucky test datasets, respectively, while the multi-site NLP ruleset improved performances to 0.884, 0.769 and 0.806.The performance of the multi-site NLP algorithm was better than that of the single-site NLP algorithm, but both showed a degrading trend from Mayo site to other sites.
Tables 4, 5 and 6 show the results of error analysis for the three sites.For FP, major discrepancies between the NLP algorithm and the gold standard were due to the NLP algorithm extracting mentions that are not COVID signs/symptoms but for instruction/patient education, adverse events/indication of treatment, clinical goal/precaution, template, etc.It should be noted that gold standards were not always correct, and in some notes, it was hard to judge if the mentions are COVID signs/symptoms when symptoms are not appearing with COVID or de-identified dates are inconsistent.For FN, reasons include NLP algorithm not complete, tokenization error due to deidentification process, template, and annotation errors.

Discussion
In Here, we have presented our results from running our framework using a centralized annotation process on texts sourced from multiple sites after de-identification, with the aim of assessing the impact on NLP algorithm development (single-site algorithm vs multi-site algorithm).Several pragmatic implementation challenges were discovered that may impact the intermediate and final NLP results.We observed that IAA varied greatly between the three sites despite the fact that annotators had been trained using de-identified Mayo notes (0.686 F1-score for Mayo, 0.516 for UMN, 0.211 for UKen).Firstly, utilizing a centralized annotation approach, the process of text data collection took a very long time because each site needs to complete de-identification before sharing data.Secondly, it was a challenge for annotators to work on annotation tasks that spanned a long period of time.Thirdly, the shared data sets were usually small, and as such, annotators had no chance to do annotation training using these outside notes, and it was hard for them to get familiar with the disparate variety of document structures from other sites.
Both multi-site and single-site NLP algorithms showed a degrading trend in performance from Mayo site to other sites, albeit this issue being less prominent in the multi-site NLP algorithm as compared to the single-site NLP algorithm.The data sharing issues also impacted NLP algorithm performance.First, training sets from outside institutions were very small due to the small number of shared notes, causing difficulties in developing comprehensive rules as features, patterns, and contextual information that could appear in third party narratives could not be fully represented in such a small sample.Second, de-identification processes could cause text span issues that may impact the input text format and thus NLP algorithm performance.The algorithm performance for algorithms developed through a centralized mode was therefore not ideal for immediate use at multiple sites, as additional local fine-tuning is still needed before final implementation and application.
Our experiment results showed that a centralized approach towards multi-site NLP algorithm development is suboptimal for advancing the adoption of NLP techniques in the clinical and translational research community, this further support our proposed federated method.The experiment also demonstrates that deployment of NLP algorithms for multi-site studies needs to be done in each local site.To ensure the scientific rigor of the data generated, each site need to perform annotation and evaluation on their own while collectively contributing to NLP algorithm development and refinement.Since the NLP models are to be shared in rule-based systems, the models can be shared without the concerns typically associated with language resources involving the Protected Health Information (PHI) issue.
In the proposed workflow, each site will evaluate the NLP algorithms for concept extraction by creating a gold standard corpus based on the common annotation guidelines.The federated evaluation can be deployed leveraging cloud computing through a centralized controller where NLP algorithms can be distributed to each institution.NLP Sandbox1 is an example of such an evaluation framework, which uses Docker 39 containers to encapsulate algorithm implementations.
By adopting this process, the evaluation only happens behind each institution's firewall, and only the summary statistics on NLP algorithm performance (i.e., no raw data containing PHI) is transferred out of the firewall.Performance statistics, such as the precision, recall, and F1-score, as defined depending on the experimental setting, can be obtained in near real-time and can thus be used as part of continuous development workflows.
This federated process offers several benefits.For instance, when conducting error analysis, we discovered that contexts played an important role in this case study.Error analyses showed it was not a trivial task to extract COVID signs/symptoms, as their occurrence is not necessarily isolated only to occurring due to COVID, and as they could appear as adverse events/indication of treatment, or in instruction/patient education, or clinical goal/precaution, etc.This posed a challenge not only for annotation, but also for the NLP algorithm development.One benefit of the federated annotation and development process is that these contexts can be systematically incorporated by local expertise in the annotation process.
Deployment of a federated development framework requires the participation of multiple sites.
Adoption can, however, be hindered by the fact that the process of translating NLP algorithms into implementation is complex, much like the "bench to bedside" process that translates laboratory discoveries into patient care.To facilitate participation in our federated method, we have developed a further suite of tools such as MedTator 40 and best practice guidelines 41 .MedTator, a serverless annotation tool, aims to provide an intuitive and interactive user interface for highquality annotation corpus generation.The best practice guideline contains detailed instructions for facilitating multisite annotation practice with the following key activities: task formulation, cohort screening, annotation guideline development, annotation training, annotation production, and adjudication.
Simply having the toolsets be available, is, however, insufficient.Pragmatically, we have seen that there is a hyper focus on novel methods in academia with competing as opposed to collaborative priorities in NLP algorithm development.Our experience suggests that a collaborative development process for NLP algorithms is needed for truly implementable and useful multi-site NLP solutions.This is one of the key goals we seek to achieve with the Open Health Natural Language Processing (OHNLP) Collaboratory and have thus positioned our framework's workflow to facilitate this task.Additionally, we recognize that it is not simply a software problem, a local workforce is also needed at each institution.As a consequence of conducting coordinated development of NLP algorithms deployed using our framework as a solution for consortia-specific tasks such as with the N3C, we simultaneously build the human workforce locally at institutions necessary to conduct the federated development, evaluation, and implementation of NLP algorithms using our framework.

Design Principles
Incorporating standards and interoperability.A common barrier to the widespread adoption of NLP in clinical research is the need to transform input and outputs to conform to part of an overall pipeline.While seemingly straightforward, such a task is difficult without prior significant investment in associated infrastructure and dedicated software development.It is therefore desirable to leverage existing infrastructure where possible and incorporate such an effort into the distributed NLP pipeline to reduce technical burden on the end user.
There is, however, significant variation in terms of available infrastructure and data availability amongst different institutions.Creating a solution that is immediately suitable for all these environments out of the box would be immensely challenging.For that reason, we sought to leverage existing data modeling efforts that are likely to be already adopted by academic medical institutions to standardize the data ingestion and output process.In our implementation, we chose the Observational Health Data Sciences and Informatics' Observational Medical Outcomes Partnership common data model (OHDSI/OMOP CDM) to handle input of clinical narratives via the NOTE table and output via the NOTE_NLP table.This brings the advantage that input/output is now standardized: so long as institutions have already transformed their clinical data into the OMOP CDM, and/or their downstream NLP-reliant applications read from the OMOP CDM database, no additional technical development burden is needed.
It is important to note that standardization as a default only serves to simplify adoption for those who already have a solution complying with the standard and cannot be a comprehensive solution.
A purely OMOP CDM reliant solution is not ideal, as not all institutions will have their own OMOP CDM instance and standing up such an instance to just use a pipeline may produce undue burden.
For that reason, input/output in our infrastructure is modularized, and can be substituted Crowdsourcing algorithm development.To promote collaboration and sharing efforts between participants in the algorithm development process, we built a crowdsourcing platform for domain experts to upload, customize, and examine their NLP algorithms in an interactive web application.
Users can create keyword-based and rule-based algorithms and test the performance in the online environment instantly.The crowdsourcing platform consists of three modules based on our NLP system to support expert collaboration, including dictionary builder, regular expression rule set editor, and detection result visualization.
The dictionary builder can extend the keyword collection used by the algorithm.Users can customize particular terms from the ontology database such as CIDO 42 and MONDO 43 .The regular expression rule set editor provides an integrated interface to help users customize their own regular expression rule set (on top of an existing dictionary, if desired), to support use cases such as extraction of new symptoms, treatments, or outcomes.The detection result visualization is designed based on Brat annotation tool 44 to check the results generated by different methods.

Case study
The University of Minnesota at Twin Cities (UMN) were initially collected.Notes that were not office visit notes (e.g., nurse calls, etc.), notes that had fewer than 1000 characters, and notes that were authored more than 14 days prior to the date of the patient's earliest positive COVID-19 test result were further filtered out.A total of 369 clinical notes from these sites that met these criteria were randomly selected, de-identified using the de-identification program developed by the Medical The framework is distributed as open-source software under the Apache 2.0 license via Github in three parts: 1) ETL Backbone (https://github.com/OHNLP/Backbone)with an example NLP engine (https://github.com/OHNLP/MedTagger),2) process documentation (https://github.com/OHNLP/N3C-NLP-Documentation/wiki),3) open-source collaborative platform for developing NLP rulesets (https://github.com/OHNLP/OHNLPTK).The demo homepage (Figure 2(a) -https://ohnlp4covid-dev.n3c.ncats.io/)demonstrates the N3C NLP engine outputs on annotating clinical text using the baseline rulesets and dictionary.The annotations are from components of Sign/Symptom extractor, temporal information extractor and dictionary lookup extractor.To further customize each model, the users can visit "Rule Editor" (https://ohnlp4covid-dev.n3c.ncats.io/ie_editor)and the "Dictionary Builder"(https://ohnlp4covid-dev.n3c.ncats.io/dict_builder)page (Figure2(b)).Figure2(c) provides an example of the rules editing interface with the baseline COVID-19 ruleset.The rulesets can be tested in real time by clicking the "Upload and test" button, where the rulesets will be uploaded, and the NLP engine will be generated for testing and debugging purposes.As a use case study, we also provide an example NLP project for extracting signs/symptoms related to COVID-19 that was developed as an example use case for this framework.The elements with original texts such as text snippets and concept mentions are truncated before submission.
this study, we proposed an open NLP development framework with the following properties: an interoperable NLP infrastructure, a transparent multi-site participation workflow, and a usercentric crowdsourcing interface.The key goal of this framework is to facilitate multi-site collaborative development, evaluation, and implementation of NLP algorithms.The framework has been implemented to support efforts conducted by the National COVID Cohort Collaborative (N3C) to enable the utilization of unstructured text in high throughput.
at will: the default OMOP CDM I/O utilizes a variant of SQL-based data extractors/writers, and the specific query and connection strings used can be substituted via plaintext configuration changes.Additionally, SQL-based I/O is not the only supported setting, a variety of other data sources including Elasticsearch, google cloud storage, amazon s3, and plaintext are included as well as configuration-swappable options.
National COVID Cohort Collaborative 36 (N3C) is a novel partnership that includes the Clinical and Translational Science Awards (CTSA) Program hubs, the National Center for Advancing Translational Science (NCATS), the Center for Data to Health (CD2H) and the community, focusing on collaborative sharing of structured EHR data.Access to unstructured data is limited due to protection of PHI and clinical care decision logics, that were further contributing to NLP infrastructure lacking within the consortia.However, structured data does not show the whole picture from the EHR perspective, greatly restricting research activities.In this case study, extraction of COVID-19 signs and symptoms was used as a case study to investigate the viability of the proposed framework among sites participating in the N3C.Centralizing gold standard corpus development.Due to resources and time constraints at each of the N3C sites, we opted to conduct the gold standard corpus development process in a centralized manner.A collection of de-identified and synthesized clinical documents was gathered from participating sites through an existing de-identification effort led by the NCATS Clinical Data to Health (CD2H).The N3C deidentification and synthetic text generation workflow is illustrated in Figure 1.Specifically, clinical notes from patients with positive COVID-19 test results from three institutions, Mayo Clinic, the University of Kentucky (UKen), and the