Integrative data semantics through a model-enabled data stewardship

Abstract Motivation The importance of clinical data in understanding the pathophysiology of complex disorders has prompted the launch of multiple initiatives designed to generate patient-level data from various modalities. While these studies can reveal important findings relevant to the disease, each study captures different yet complementary aspects and modalities which, when combined, generate a more comprehensive picture of disease etiology. However, achieving this requires a global integration of data across studies, which proves to be challenging given the lack of interoperability of cohort datasets. Results Here, we present the Data Steward Tool (DST), an application that allows for semi-automatic semantic integration of clinical data into ontologies and global data models and data standards. We demonstrate the applicability of the tool in the field of dementia research by establishing a Clinical Data Model (CDM) in this domain. The CDM currently consists of 277 common variables covering demographics (e.g. age and gender), diagnostics, neuropsychological tests and biomarker measurements. The DST combined with this disease-specific data model shows how interoperability between multiple, heterogeneous dementia datasets can be achieved. Availability and implementation The DST source code and Docker images are respectively available at https://github.com/SCAI-BIO/data-steward and https://hub.docker.com/r/phwegner/data-steward. Furthermore, the DST is hosted at https://data-steward.bio.scai.fraunhofer.de/data-steward. Supplementary information Supplementary data are available at Bioinformatics online.


Supplementary Text 1. Reasons outlining why OMOP was not suitable for dementia datasets
The OMOP Data Model, developed and maintained by OHDSI, is highly popular in bioinformatics since it provides a good standardization capability, while OHDSI also offers analysis tools that work with data in OMOP.
Since the beginning of the project, we have worked very closely with clinicians from the University Clinic of Bonn and the DZNE. During that time, we had insight into multiple different studies dealing with dementia and ataxia data. There was no common structure between those data sources and we quickly realized that in order to harmonize as much study data as possible we need a very flexible data model.

Supplementary Text 2. Implementation of the Data Steward Tool
Django is a high level web-framework written in Python that offers plugins for MongoDB as well as for providing RESTful APIs. The application holds the structure of the Clinical Data Model in terms of Python classes and is capable of communicating that to the underlying database. Uploaded data as well as mappings and the complete data logic is handled here. Moreover the application provides the APIs that can be queried via other systems. The Vue.js web application uses that API to yield a visual interface. Vue.js is a progressive Javascript framework to build single page applications like the Data Steward Tool. The communication between the Vue.js app and the Django backend is based on the RESTful API principle that uses JSON as notation for data exchange. The underlying MongoDB database stores the underlying data model with all its variables and mappings as well as normalized clinical data uploaded by the user. The deployment of the services is realized in a microservice architecture with Docker. order to be aligned with variables and terms used outside of the German research landscape. Other studies and (meta-) data resources that were used either during the development of the base variable set or in the later work to establish mappings were AddNeuroMed -the European collaboration for the discovery of novel biomarkers for Alzheimer's disease , terms from the 4 Diagnostic and treatment center for memory disorders (DBGA) of UKB as well as multiple studies related to Ataxia we got access to from the DZNE: ESMI (European Spinocerebellar Ataxia Type 3/Machado-Joseph Disease Initiative) , SCA Registry (Registry for Spinocerebellar 5 Ataxias (SCA)) . 6

Supplementary Text 4. Definition: Variable Mappings
A crucial part of the data model are variable mappings. The most common definition for that is a reference from one variable VAR1 to another one VAR2, where VAR1 and VAR2 are semantically equivalent. A single mapping in the CDM holds the information about the external variable VAR1 and the internal variable VAR2 plus the source of VAR1. All variable mappings can be viewed in tabular form: https://data-steward.bio.scai.fraunhofer.de/data-steward/table. With more and more mappings the system can read and understand data from many different data sources. In the future we expect the number of mappings (currently 276) to outgrow the number of internal variables (277) by far.

Supplementary Text 5. Fuzzy Matching
Fuzzy string matching in the context of the DST means that we assign one variable to another based on their string similarity (https://en.wikipedia.org/wiki/Edit_distance ). The success of the fuzzy string matching highly depends on the variable naming since ABETA 42 → abeta_42 is easy to be mapped based on string similarity but things like gender → sex will most certainly never be mapped by this approach. Future work includes inproving the mapping assistant with AI-based entity matching.

Supplementary Text 6. Tutorial of the upload process
Clinical data is typically stored in 2D data tables, where in each row one measurement for one patient or subject is represented with the variable name and the measured value. The tool reads data in a csv-like format where each line represents one measurement. For example one line consists of the patient's id (Entity), the variable AGE (Attribute), and the patient's age (Value). A software transforming clinical data, that has multiple values per row, into the correct format is available. Such representation of clinical data is called EAV (Entity Attribute Value) format. The Data Steward Tool can read and process those EAV files in a drag and drop section (Figure 3).

Supplementary Figure 3. Data upload screenshot
During the reading process, the user is updated with live feedback on how many lines in the file were found for how many distinct patients and what variables could not be found in the model. After finishing the process, the tool yields a summary of the results for the user to analyze. In the case that every variable was found in the underlying Data Model, everything is done and the data is successfully semantically integrated. If that's not the case the user can always map the variables onto CDM via a guided process. From the upload feedback, the user can go to the mapping assistant (Figure 4). activate the OLS (Ontology Lookup Service) autocomplete to conveniently integrate terms from major ontologies if no suitable variable is present in the current model. If a suitable variable was found the user has to select a source for the external variable (e.g. ADNI) and then the mapping can be submitted and the table gets reduced by the respective line. In the case that the mapping assistant suggests a wrong mapping you can edit the respective line manually via a dialog window, backed up by autocompletion (Figure 5). Thus, the user has a comfortable way to map all of their data onto the CDM.

Supplementary Text 7. Aligning CDM with FHIR and OMOP
In bioinformatics, a very crucial part of research is exchanging health data between healthcare ecosystems or institutions. The FHIR (Fast Healthcare Interoperability Resources) standard of HL7 is a common solution to do that. Hence the Data Steward Tool can work as a FHIR server 7 by providing multiple APIs (API Documentation ) that return either all patient data in FHIR 8 format (observation resource defined by FHIR standard ) or single measurements for one patient. 9 Moreover users can query the DS for certain patients in certain studies (patient resource ). 10 As mentioned above, the OMOP data model is an established solution for data standardization in bioinformatics and hence it is inevitable that the CDM is aligned with OMOP. Thus, we found mappings between every variable of the CDM and the OMOP standard vocabulary using OHDSI's Athena browser . This embedding of the CDM onto OMOP is updated regularly if the 11 CDM changes or after a certain time to keep the mappings up to date. Up to now, we were able to map 181 variables out of CDM to the OMOP standard vocabulary.

Supplementary Text 8. Importing other data models by the example of i2b2
As mentioned in the paper there are some major data models out there that aim to capture general biomedical data like OMOP or domain specific data like GA4GH. The i2b2 data model is capable of describing general clinical data. In order to contribute to a research landscape where data interoperability is crucial it is important that the data steward tool is able to function with other data models in its backend. We have created a python package available on GitHub that 12 contains an example of how to import all i2b2 variables into the DST.