ROMOP: a light-weight R package for interfacing with OMOP-formatted electronic health record data

Abstract Objectives Electronic health record (EHR) data are increasingly used for biomedical discoveries. The nature of the data, however, requires expertise in both data science and EHR structure. The Observational Medical Outcomes Partnership (OMOP) common data model (CDM) standardizes the language and structure of EHR data to promote interoperability of EHR data for research. While the OMOP CDM is valuable and more attuned to research purposes, it still requires extensive domain knowledge to utilize effectively, potentially limiting more widespread adoption of EHR data for research and quality improvement. Materials and methods We have created ROMOP: an R package for direct interfacing with EHR data in the OMOP CDM format. Results ROMOP streamlines typical EHR-related data processes. Its functions include exploration of data types, extraction and summarization of patient clinical and demographic data, and patient searches using any CDM vocabulary concept. Conclusion ROMOP is freely available under the Massachusetts Institute of Technology (MIT) license and can be obtained from GitHub (http://github.com/BenGlicksberg/ROMOP). We detail instructions for setup and use in the Supplementary Materials. Additionally, we provide a public sandbox server containing synthesized clinical data for users to explore OMOP data and ROMOP (http://romop.ucsf.edu).


ROMOP
ROMOP is a flexible R package to interface with the Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model. Briefly, OMOP is a standardized relational database schema for Electronic Health Record (EHR) or Electronic Medical Record (EMR) data (i.e., patient data collected during clinical visits to a health system). The main benefit of a standardized schema is that it allows for interoperability between institutions, even if the underlying EHR vendors are disparate.
For a detailed description of the OMOP common data model, please visit this helpful wiki.
In its backend, OMOP relies on standardized data ontologies and metathesaureses, such as the Unified Medical Language System (UMLS), and as such, the queries within ROMOP heavily rely on these vocabularies. Athena is a great tool to better understand the concepts in these ontologies and identify ideal search terms of interest.

Sandbox Server
The Centers for Medicare and Medicaid Services (CMS) have released a synthetic clinical dataset DE-SynPUF) in the public domain with the aim of being reflective of the patient population but containing no protected health information. The OHDSI group has underwent the task of converting these data into the OMOP CDM format. Users are certainly able to set up this configuration on their own system following the instructions on the GitHub page. We obtained all data files from the OHDSI FTP server (accessed June 17th, 2018) and created the CDM (DDL and indexes) according to their official instructions, but modified for MySQL. For space considerations, we only uploaded one million rows of each of the data files. The sandbox server is a Rshiny server running as an Elastic Compute Cloud (EC2) instance on Amazon Web Services (AWS) querying a MySQL database server (AWS Aurora MySQL).

Clinical Data
ROMOP requires EHR data to be in OMOP format and on a server accessible to by the user. In it's current form, ROMOP can connect to databases in MySQL using the RMySQL driver or many other formats, including Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Google BigQuery, and Microsoft Parallel Data Warehouse, through utilization of the DatabaseConnector and SqlRender packages developed by the OHDSI group (see below).
Users without access to EHR data might consider using synthetic public data following the instructions provided by the OHDSI group here.

Programming Language
ROMOP is built in the R environment and developed on version 3.4.4 (2018-03-15).
ROMOP requires the following R packages: • DBI (developed on version 1.0.0) • data.

Installation
Download ROMOP can be installed easily from github using the devtools package: Alternatively, the package can be downloaded directly from the github page and installed by the following steps: 1. Unzip ROMOP-master.zip 2. R CMD INSTALL ROMOP-master Please see the Setup section to properly configure the package to work.

Credentials
In accordance with best practices for storing sensitive information, credentials are not saved in plain text but in the .Renviron file. A formatted .Renviron file is provided with the package with the following fields to fill in: driver = "" host = "" username = "" password = "" dbname = "" port = "3306" • driver (case insensitive): "mysql" for MySQL or (according to OHDSI DatabaseConnector package) "postgresql" for PostgreSQL, "oracle" for Oracle, "sql server" for Microsoft SQL Server, "redshift" for Amazon Redshift, "pdw" for Microsoft Parallel Data Warehouse, or "bigquery" for Google BigQuery.
• host (or server depending on database format) • dbname: OMOP EHR database name (or schema depending on database format) Note that this .Renviron file has to be in the same directory where R is launched. If already using an .Renviron file, add this information to it.

Checks
With credentials correctly configured, the package can be loaded. ROMOP will now check for 3 conditions to be met: 1. Check that the credentials exist and can be retrieved from .Renviron file: requires driver, host, username, password, dbname, and port exist 2. Check that connection to OMOP EHR server and database can be made: uses the above credentails 3. Check to ensure all required OMOP tables exist and contain (any) data: the required tables are: "concept","concept_ancestor","concept_relationship","condition_occurrence","death", "device_exposure","drug_exposure","measurement","observation","person","procedure_occurrence","visit_occ • if any of the above tables are missing, a warning message will be produced and the package will not be able to load properly. • if any of the above tables exist, but do not contain any data, a warning message will be produced but the package will still be able to function.

On start
Successfully pasing all checks will allow the user to begin using ROMOP.
1. Set an output directory to use with the changeOutDirectory function (note: the default output directory will be declared on package load).
2. Create/load the Data ontology (required to decode data types) using the makeDataOntology. For the first time running this package, the concept ontology will have to first be built, but if the store_ontology option is selected, the ontology will be saved as an .rds file for subsequent loading.

Utility getDemographics
Description: Retrieves and formats patient demographic data from the person and death tables. Option to restrict to patientlist of interest.
Usage: ptDemo <-getDemographics(patient_list=NULL,declare=TRUE) Arguments: patient_list comma-separated string of patient ids a provdied patientlist will restrict search to ids. NULL will return demographic data for all available patients • patient_list should be in the following format: "patient_id_1, patient_id_2, . . . " getClinicalData Description: Retrieves all relevant clinical data for individuals in a patientlist. Wrapper for domain-specific getData functions (which can also be used separately).

Usage: ptClinicalData <-getClinicalData(patient_list, declare=TRUE)
Arguments: patient_list comma-separated string of patient ids a provdied patientlist will restrict search to ids. NULL will return demographic data for all available patients declare TRUE/FALSE if TRUE, outputs status and updates to the screen

Arguments:
strategy_in mapped or direct dictates the strategy for how inclusion criteria are treated (see Details). vocabulary_in vocabularies for inclusion criteria comma-separated string of relevant vocabularies for inclusion criteria (see Details).
codes_in specific concept codes for inclusion criteria semi-colon separated string of code concepts for inclusion criteria, corresponding to the order for vocabulary_in. Multiple codes can be used per vocabulary and should be comma-separated (see Details).
function_in and or or dictates how multiple inclusion should be treated. and necessitates that all inclusion criteria are met (i.e., intersection), while or allows for any critera to be met (i.e., union) (see Details). out_name name assigned to search query or NULL if save == TRUE, saves query using provided name. If the provided name already exists as a directory (or is NULL), the directory defaults to datetime name (see Details).

Value:
Returns a list of patients that meet inclusion criteria (and not exclusion criteria if entered).

Details:
• direct strategy queries the concepts directly by _source_concept in clinical tables. mapped maps to common ontology (via concept_synonym) and identifies relevant descendants (via concept_ancestor) to search for in _concept fields. • the exploreConcepts function can be used to find ideal concepts to search for.
• function_ corresponds to how criteria should be treated. and necessitates patients meet all criteria while or allows for patients to meet any of the criteria. • Please note that if no standard common concepts are found per search domain, a warning message will appear and the search will not be able to be performed (see Helpful Hints for more details.) • if save == TRUE, the following information is saved in a directory per query: query: all arguments for the search.
-_criteria_mapped: all original criteria for inclusion (and exclusion if applicable) that are mapped to dataOntology. -criteria_mapped_concepts: all mapped concepts used for inclusion (and exclusion if applicable) that are used to search in clinical data tables. Additionally, the pt_count column displays the number of unique patients that have a record with the corresponding concept.
outcome: results of the search (most relevant when exclusion criteria are applied).
-patient_list: list of patients that meet inclusion (and not exclusion, if applicable) criteria.

Misc. changeOutDirectory
Description: Sets the current outDirectory which will store the Data Ontology and all function output. Option to create directory if does not exist.
Usage: changeOutDirectory(outdir="path/to/directory", create=FALSE) Arguments: outdir directory path create TRUE/FALSE will create the directory if it does not exist

Value:
Nothing returned; simply sets (and creates if set to) output directory Details: • If directory does not exist and create=FALSE, a warning message will appear and the output directory will not be changed.

Details:
• Generating the Data Ontology takes~31.2 secs and is~491.6 Mb.
• If declare == TRUE, the following information will be returned: codes concept codes semi-colon separated string of code concepts for inclusion criteria, corresponding to the order for vocabulary. Multiple codes can be used per vocabulary and should be comma-separated (see Details).

Value:
Returns a table of concepts contained under (i.e., below in the heirarchy) the query concept.

Details:
• vocabulary input for multiple inputs should use relevant vocabularies (see showDataTypes ) as a comma-separated string, e.g., "ATC, ICD10CM". • codes input correspond to the order as the vocabulary input and should be semi-comma separated string in the same order as above. Multiple terms per vocabulary type should be comma-separated. e.g., "A01A; K50, K51" correspond to "A01A" for ATC and "K50" and "K51" for ICD10CM.

Examples
Both simple and advanced findPatients queries will be outlined. See the Output section for description of output if save == TRUE. For the process timing provided, all queries were run on an Amazon Elastic Compute Cloud (EC2) instance.

Disease category (ICD10CM): find all "Type 2 Diabetes Mellitus" patients (E11)
Here we will set a single inclusion criterion. The inclusion vocbulary is set to ICD10CM and the inclusion code is E11 corresponding to the vocabulary. Because the inclusion strategy is set as "mapped", ROMOP will map the ICD10CM code to a common ontology (SNOMED) term and find all descendants to search for (see Code Breakdown for details on how this works). Here we will search for patients that have the specific ICD9CM code 250.11 only, i.e., not map to common ontology (see Code Breakdown for the importance of this distiction).
query patient_list = findPatients(strategy_in="direct", vocabulary_in = "ICD9CM", codes_in = "250.11") time: 1.1 min 3. Multiple diseases (ICD10CM): find all patients with "Essential (primary) hypertension" (I10) and "Angina pectoris with documented spasm" (I20.1) Here we will search for patients that have the multiple ICD10CM codes. While we put a single inclusion vocabulary, we will put two inclusion codes separated by a comma. Also we set the inclusion function to "and" which requires both criteria to be met.
query patient_list = findPatients(strategy_in="mapped", vocabulary_in = "ICD10CM", codes_in = "I10, I20.1", fu time: 23.8 secs 4. Drug class (ATC): find all patients prescribed with any "Serotonin receptor antagonists" (A03AE) Here we will search for patients by drug ATC code. As the inclusion strategy is set to "mapped", all drugs that fall into this category will automatically be identified and searched for (see Code Breakdown for details on how this works).
query patient_list = findPatients(strategy_in="mapped", vocabulary_in = "ATC", codes_in = "A03AE") time: 1.1 secs 5. Disease category (ICD10CM) but not Drug (MeSH): find all patients with "Other anxiety disorders" (F31), but not prescribed with "Clonazepam" (D002998) Here we will search for patients by ICD10CM code as before. We also identify all patients prescribed with the MeSH term for "Clonazepam", which will be removed from the original list.

Output
All output is saved in the output directory (use changeOutDirectory to set). Additionally, the data ontology file will be loaded from here and saved if set to using the makeDataOntology](#makedataontology) function.
If save==TRUE is selected for findPatients queries, various information will be saved in a created queryspecific directory within the outDirectory: + query: all arguments for the search. + _criteria_mapped: all original criteria for inclusion (and exclusion if applicable) that are mapped to dataOntology. + criteria_mapped_concepts: all mapped concepts used for inclusion (and exclusion if applicable) that are used to search in clinical data tables. Additionally, the pt_count column displays the number of unique patients that have a record with the corresponding concept. + outcome: results of the search (most relevant when exclusion criteria are applied). + patient_list: list of patients that meet inclusion (and not exclusion, if applicable) criteria.
We will detail the respective output files that are derived from Simple Examples #5:  however, facilitates more powerful searches. In most extracted EHR systems, the user has to define all medications to search, for instance through a pre-populated list or by wildcard string matching (e.g., all drug names LIKE "%statin%"). This strategy is ultimately not ideal as it is not extensible to other systems (e.g., one system might prescribe a version or formulation of a drug that is in not in another) and requires extensive manual quality-control (e.g., removing "nystatin" drugs from the string matching results). For the findPatients function, if the "mapped" option is selected, searching for a broad code like ATC level 3 code A05A (bile therapies), or even a specific term code like RxNorm code 1544460 for idelalisib, will automatically identify and query for all bottom-level (e.g., idelalisib 150 MG Delayed Release Oral Tablet) codes contained underneath that seed concept. This works by ROMOP first mapping the initial search criteria to a standard concept (SNOMED or RxNorm) and finding all descendants underneath it. Another benefit to this "mapped" option is that terms are not reliant on how the data were originally entered. For instance, if a health system switches from ICD-9CM to ICD-10CM coding, there might be discrepancies in prevalence of codes over time.
Mapping to a common concept, however, often alleviates this issue as codes from both vocabularies are typically linked to a common code in the standard vocabulary. Of course the user can search for the concepts they entered only using the "direct" option (i.e., search for ICD-9CM code 230.0 only).

Helpful Hints
• We recommend using the mapped argument for the findPatients function because the concepts will not depend on by which format the data was entered (i.e., the source_concept). This is important as diffierent institutions may utilize different underlying terminologies, as well as switch primary data entry vocabularies over time (i.e., the switch from ICD-9 to ICD-10). For example, if the user is interested in "Trigeminal neuralgia", using the ICD-10 code "G50.1" with the direct argument, all prior entries that utilized the corresponding ICD-9 code ("350.1") most likely will not be found as many data warehouses do not "back-map" codes. Using the mapped argument will bypass this issue as the standard concept will be used which should capture both options. • Standard vocabularies: while the OMOP common data model utilizes many ontologies, SNOMED and RxNorm are used primarily for common concepts in the clincal data tables. As such, while any vocabulary can be used for findPatients, the mapped function will only be able to find data contained within the following common concepts per domain: Consequently, if inclusion/exclusion criteria can be be mapped to the data ontology, but no synonym/descendants are contained within the above common concepts, no search will be performed (as no patients would be returned). This most directly affects searching for Drug concepts, in which we reccommend not using standard common concepts (e.g., RxNorm, ATC) for search criteria.
• To ensure complete capture of data concepts of interest, we recommend identifying multiple vocabulary/codes to use using the Athena resource. For instance, if interested in finding all individuals taking a Benzodiazepine, consider using both the relevant ATC classes (e.g., N03AE) as well as the relevant Substance (SNOMED) codes (e.g., 16047007). The exploreConcepts function can be used to identify and prioiritize which codes are optimal to use.

Copyright (c) 2018 Benjamin S. Glicksberg
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Contact
For questions, comments, errors, bug reports, or issues, please contact: benjamin.glicksberg@ucsf.edu For general correspondance, please contact: atul.butte@ucsf.edu