ricu: R’s interface to intensive care data

Abstract Objective To develop a unified framework for analyzing data from 5 large publicly available intensive care unit (ICU) datasets. Findings Using 3 American (Medical Information Mart for Intensive Care III, Medical Information Mart for Intensive Care IV, electronic ICU) and 2 European (Amsterdam University Medical Center Database, High Time Resolution ICU Dataset) databases, we constructed a mapping for each database to a set of clinically relevant concepts, which are grounded in the Observational Medical Outcomes Partnership Vocabulary wherever possible. Furthermore, we performed synchronization in the units of measurement and data type representation. On top of this, we built functionality, which allows the user to download, set up, and load data from all of the 5 databases, through a unified Application Programming Interface. The resulting ricu R-package represents the computational infrastructure for handling publicly available ICU datasets, and its latest release allows the user to load 119 existing clinical concepts from the 5 data sources. Conclusion The ricu R-package (available on GitHub and CRAN) is the first tool that enables users to analyze publicly available ICU datasets simultaneously (datasets are available upon request from respective owners). Such an interface saves researchers time when analyzing ICU data and helps reproducibility. We hope that ricu can become a community-wide effort, so that data harmonization is not repeated by each research group separately. One current limitation is that concepts were added on a case-to-case basis, and therefore the resulting dictionary of concepts is not comprehensive. Further work is needed to make the dictionary comprehensive.


Introduction
Collection of electronic health records has seen a significant rise in recent years (Evans 2016), opening up opportunities and providing the grounds for a large body of data-driven research oriented towards helping clinicians in decision-making and therefore improving patient care and health outcomes (Jiang, Jiang, Zhi, Dong, Li, Ma, Wang, Dong, Shen, and Wang 2017).
One example of a problem that has received much attention from the machine learning community is early prediction of sepsis in ICU (Desautels, Calvert, Hoffman, Jay, Kerem, Shieh, Shimabukuro, Chettipally, Feldman, Barton et al. 2016;Nemati, Holder, Razmi, Stanley, Clifford, and Buchman 2018;Futoma, Hariharan, Sendak, Brajer, Clement, Bedoya, O'Brien, and Heller 2017;Kam and Kim 2017).Interestingly, there is evidence that a large proportion of the publications are based on the same dataset (Fleuren, Klausch, Zwager, Schoonmade, Guo, Roggeveen, Swart, Girbes, Thoral, Ercole, Hoogendoorn, and Elbers 2020), the Medical Information Mart for Intensive Care III (MIMIC-III; Johnson, Pollard, Shen, Li-wei, Feng, Ghassemi, Moody, Szolovits, Celi, and Mark 2016), which shows a systematic lack of external validation.Part of this problem might well be the need for computational infrastructure handling multiple datasets.The MIMIC-III dataset consists of 26 different tables containing about 20GB of data.While much work and care has gone into data preprocessing in order to provide a self-contained ready -to-use data resource with MIMIC-III, seemingly simple tasks such as computing a sepsis-related organ failure assessment (SOFA) score (Vincent, Moreno, Takala, Willatts, De Mendonça, Bruining, Reinhart, Suter, and Thijs 1996) remains a non-trivial effort 1 .This is only exacerbated when aiming to co-integrate multiple different datasets of this form, spanning hospitals and even countries, in order to capture effects of differing practice and demographics.
The aim of the ricu package is to provide computational infrastructure allowing users to investigate complex research questions in the context of critical care medicine as easily as possible by providing a unified interface to a heterogeneous set of data sources.The package enables users to write dataset-agnostic code which can simplify implementation and shorten the time necessary for prototyping code querying different datasets.In its current form, the package handles four large-scale, publicly available intensive care databases out of the box: MIMIC-III from the Beth Israel Deaconess Medical Center in Boston, Massachusetts (Johnson et al. 2016), the eICU Collaborative Research Database (Pollard, Johnson, Raffa, Celi, Mark, and Badawi 2018), containing data collected from 208 hospitals across the United States, the High Time Resolution ICU Dataset (HiRID) from the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (Faltys, Zimmermann, Lyu, Hüser, Hyland, Rätsch, and Merz 2021) and AmsterdamUMCdb from the Amsterdam University Medical Center (Thoral, Peppink, Driessen, Sijbrands, Kompanje, Kaplan, Bailey, Kesecioglu, Cecconi, Churpek, Clermont, van der Schaar, Ercole, Girbes, Elbers, Force, and the SCCM/ESICM Joint Data Science Task 2021).Furthermore, ricu was designed with extensibility in mind such that adding further public and/or private user-provided datasets is possible.Being implemented in R, a programming language popular among statisticians and data analysts, it is our hope to contribute to accessible and reproducible research by using a familiar environment and requiring only few system dependencies, thereby simplifying setup considerably.
To our knowledge, infrastructure that provides a common interface to multiple such datasets is a novel contribution.While there have been efforts (Adibuzzaman, Musselman, Johnson, Brown, Pitluk, and Grama 2016;Wang, McDermott, Chauhan, Ghassemi, Hughes, and Naumann 2020) attempting to abstract away some specifics of a dataset, these have so far exclusively focused on MIMIC-III, the most popular of public ICU datsets and have not been designed with dataset interoperability in mind.
Given the somewhat narrow focus of the targeted datasets, combined with the fact that in some cases data is even extracted from identical patient care management systems, it may come as a surprise as to how heterogeneous the resulting datasets are.In MIMIC-III and HiRID, for example, time-stamps are reported as absolute times (albeit randomly shifted due to data privacy concerns), whereas eICU and AUMC use relative times (with origins being admission times).Another example, involves different types of patient identifiers and their use among datasets.Common to all is the notion of an ICU admission ID, but apart from that, the amount of available information varies: While ICU (and hospital) readmissions for a given patient can be identified in some, this is not possible in other datasets.Furthermore, use of identifier systems might not be consistent over tables.In MIMIC-III, for example, some tables refer to ICU stay IDs while others use hospital stay IDs, which slightly complicates data retrieval for a given ID system.Additionally, table layouts vary (long versus wide data arrangement) and data organization in general is far from consistent over datasets.

Quick start guide
The following list gives a quick outline of the steps required for setting up and starting to use ricu, alongside some section references on where to find further details.A more comprehensive version of this overview is available as a separate vignette.

Package installation:
• the latest release of ricu can be installed from CRAN as install.packages("ricu") • alternatively, the latest development version is available from Github by running remotes::install_github("eth-mds/ricu") 2. Requesting access to datasets and data source setup: • demo datasets can be set up by installing the data packages mimic.demoand/or eicu.demo,passing "https://eth-mds.github.io/physionet-demo"as repos argument to install.packages() • the complete MIMIC-III, eICU and HiRID datasets can be accessed by setting up an account at PhysioNet • access to AUMCdb is available via the Amsterdam Medical Data Science Website • the obtained credentials can be configured for PhysioNet datasets by setting environment variables RICU_PHYSIONET_USER and RICU_PHYSIONET_PASS, while the download token for AUMCdb can be set as RICU_AUMC_TOKEN • datasets are downloaded and set up either automatically upon the first access attempt or manually by running setup_data_src(); the environment variable RICU_DATA_PATH can be set to control data location • dataset availability can be queried by calling src_data_avail() A more detailed description of the datasets and the setup process is given in Section 3, with Section 3.1 providing an overview of each of the 4 supported datasets and Section 3.2 elaborating on how datasets are represented in code.
3. Loading of data corresponding to clinical concepts using load_concepts(): • currently, over 100 data concepts are available for the 4 supported datasets (see concept_availability()/explain_dictionary() for names, availability etc.) • both concepts and data sources can be specified as strings; for example, glucose and age data can be loaded from MIMIC-III as load_concepts(c("age", "glu"), "mimic") Section 4 goes into more detail on how data concepts are represented within ricu and an overview of the pre-configured concepts is available from Section 4.3.
4. Extending the concept dictionary: • data concepts can be specified in code using the constructors concept()/item() or new_concept()/new_item() • for session persistence, data concepts can also be specified as JSON formatted objects • JSON-based concept dictionaries can either extend or replace others and they can be pointed to by setting the environment variable RICU_CONFIG_PATH The JSON format used to encode data concepts is discussed in more detail in Section 4.1.

Adding new datasets:
• a JSON-based dataset configuration file is required, from which the configuration objects described in Section 3.2 are created • in order for concepts to be available from the new dataset, the dictionary requires extension by adding new data items Some further information about adding a custom dataset is available from Section 3.3, albeit not in much detail.Some code used when AUMCdb was not yet fully integrated with ricu is available from Github.
The final section (5) shows briefly how ricu could be used in practice to address clinical questions by presenting two small examples.

Data sources
In order to make data available from different data sources, ricu provides abstractions using JSON-formatted configuration files and a set of S3 classes with associated S3 generic functions.This system is designed with extensibility in mind, allowing for incorporation of a wide variety of datasets.Provisions for several large-scale publicly available datasets in terms of required configuration information alongside class-specific implementations of the needed S3 generic functions are part of ricu, opening up access to these datasets.Data itself, however, is not part of ricu but rather can be downloaded from the Internet using tools provided by ricu.While the datasets are publicly available, access has to be granted by the dataset creators individually.Three datasets, MIMIC-III, eICU and HiRID are hosted on PhysioNet (Goldberger, Amaral, Glass, Hausdorff, Ivanov, Mark, Mietus, Moody, Peng, and Stanley 2000), access to which requires an account, while the fourth, AmsterdamUMCdb is currently distributed via a separate platform, requiring a download link.
For both MIMIC-III and eICU, small subsets of data are available as demo datasets that do not require credentialed access to PhysioNet.As the terms for distribution of these demo datasets are less restrictive, they can be made available as data packages mimic.demoand eicu.demo.Due to size constraints, however they are not available via CRAN, but can be installed from Github as R> install.packages(+ c("mimic.demo","eicu.demo"),+ repos = "https://eth-mds.github.io/physionet-demo"+ ) Provisions for datasets configured to be attached during package loading are made irrespective of whether data is actually available.Upon access of an incomplete dataset, the user is asked for permission to download in interactive sessions and an error is thrown otherwise.Credentials can either be provided as environment variables (RICU_PHYSIONET_USER and RICU_PHYSIONET_PASS for access to PhysioNet data, as well as RICU_AUMC_TOKEN for AmsterdamUMCdb) and if the corresponding variables are unset, user input is again required in interactive sessions.For non-interactive sessions, functionality is exported such that data can be downloaded and set up ahead of first access (see ?setup_src_data).

Ready to use datasets
Contingent on being granted access by the data owners, several large-scale ICU datasets collected from multiple hospitals in the US and Europe can be set up for access using ricu with minimal user effort.Download requires a stable Internet connection, as well as 50 to 100 GB of temporary disk storage for unpacking and preparing the data for efficient access.In terms of permanent storage, 5 to 10 GB per dataset are required, while memory requirements permit importing (and working with) even the largest tables using Laptop class hardware as only subsets of rows are read at once.
The following paragraphs serve to give quick introductions to the included datasets to offer some guidance, outlining some strengths and weaknesses of each of the datasets.Especially the PhysioNet datasets MIMIC-III and eICU offer good documentation on the respective websites.This section is concluded with a table summarizing similarities and differences among the datasets, outlined in the following paragraphs (see table 1).

MIMIC-III
The Medical Information Mart for Intensive Care III (MIMIC-III) represents the third iteration of the arguably most influential initiative for collecting and providing to the public largescale ICU data 2 .The dataset comprises de-identified health related data of roughly 46,000 patients admitted to critical care units of BIDMC during the years 2001-2012.Amounting to just over 61,000 individual ICU admission, data is available on demographics, routine vital sign measurements (at approximately 1 hour resolution), laboratory tests, medication, as well as critical care procedures, organized as a 26-table relational structure.

R> mimic
2 The initial MIMIC (at the time short for Multi-parameter Intelligent Monitoring for Intensive Care) data release dates back 20 years and contained data on roughly 100 patients recorded from patient monitors in the medical, surgical, and cardiac intensive care units of Boston's Beth Israel Hospital during the years 1992-1999 (Moody and Mark 1996).Significantly broadened in scope, MIMIC-II was released 10 years after, now including data on almost 27,000 adult hospital admissions collected from ICUs of Beth Israel Deaconess Medical Center (BIDMC) from 2001 to 2008 (Lee, Scott, Villarroel, Clifford, Saeed, and Mark 2011).Following MIMIC-III, release of MIMIC-IV is imminent with a first development version having been released in summer 2020.This iteration of MIMIC too is planned to be included with ricu as soon a first stable version is released.The data is organized into 31 tables and includes patient demographics, routine vital signs, laboratory measurements, medication administrations, admission diagnoses, as well as treatment information.Owing to the wide range of hospitals participating in this data collection initiative, spanning small, rural, non-teaching health centers with fewer than 100 beds to large teaching hospitals with an excess of 500 beds, data availability varies.Even if data was being recorded at the bedside it might end up missing from the eICU dataset due to technical limitations of the collection process.As for patient identifiers, while it is possible to link ICU admissions corresponding to the same hospital stay, it is not possible to identify patients across hospital stays.
Data resolution again varies considerably over included variables.The vitalperiodic table stands out as one of the few examples of a wide table organization (laying out variables as columns), as opposed to the long presentation (following an entity-attribute-value) of most other tables containing patient measurement data.The average time step in vitalperiodic is around 5 minutes, but data missingness ranges from around 1% for heart rate and pulse oximetry to around 80-90% for blood pressure measurements, therefore giving approximately hourly resolution for such variables.

HiRID
Developed for early prediction of circulatory failure (Hyland, Faltys, Hüser, Lyu, Gumbsch, Esteban, Bock, Horn, Moor, Rieck, Zimmermann, Bodenham, Borgwardt, Rätsch, and Merz 2020), the High Time Resolution ICU Dataset (HiRID) contains data on almost 34,000 admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland, an interdisciplinary 60-bed unit.Given the clear focus on a concrete application during data collection, this dataset is the most limited in terms of breadth of available information, which is also reflected in a comparatively simple data layout comprising only 5 ricu: R meets ICU data [33,905 x 5] [776,921,131 x 8]  [72 x 3] [16,270,399 x 14]  variables  [712 x 5] Collected during the period of January 2008 through June 2016, roughly 700 distinct variables covering routine vital signs, diagnostic test results and treatment parameters are available with variables monitored at the bedside being recorded with two minute time resolution.In terms of demographic information and patient identifier systems however, the data is limited.It is not possible to identify ICU admissions corresponding to individual patients and apart from patient age, sex, weight and height, very little information is available to characterize patients.There is no medical history, no admission diagnoses, only in-ICU mortality information, no unstructured patient data and no information on patient discharge.Furthermore, data on body fluid sampling has been omitted, complicating for example the construction of a Sepsis-3 label (Singer et al. 2016).

AmsterdamUMCdb
As a second European dataset, also focusing on increased time-resolution over the US datasets, AmsterdamUMCdb has been made available in late 2019, containing data on over 23,000 intensive care unit and high dependency unit admissions of adult patients during the years 2003 through 2016.The department of Intensive Care at Amsterdam University Medical Center is a mixed medical-surgical ICU with up to 32 bed ICU and 12 bed high dependency units with an average of 1000-2000 yearly admissions.Covering middle ground between the US datasets and HiRID in terms of breadth of included data, while providing a maximal time-resolution of 1 minute, AmsterdamUMCdb constitutes a well organized high quality ICU data resource organized succinctly as a 7-table relational  A slightly different approach to data anonymization was chosen for this dataset, yielding demographic information such as patient weight, height and age only available as binned variables instead of raw numeric values.Apart from this, there is information on patient origin, mortality, admission diagnoses, as well as numerical measurements including vital parameters, lab results, outputs from drains and catheters, information on administered medication, and other medical procedures.In terms of patient identifiers, it is possible to link ICU admissions corresponding to the same individual, but it is not possible to identify separate hospital admissions.

Implementation details
Every dataset is represented by an environment with class attributes and associated metadata objects stored as object attributes to that environment.Dataset environments all inherit from src_env and from any number of class names constructed from data source name(s) with a suffix _env attached.The environment representing MIMIC-III, for example inherits from src_env and mimic_env, while the corresponding demo dataset inherits from src_env, mimic_env and mimic_demo_env.These sub-classes are later used for tailoring the process of data loading to particularities of individual datasets.
A src_env contains an active binding per contained 44228 103379 2170-12-15 03:14:00 2170-12-24 18:00:00 By using rlang pronouns (.data and .env), the distinction can readily be made between a name referring to an object within the context of the data and an object within the context of the calling environment.

Data source set-up
In order to make a dataset accessible to ricu, three steps are necessary, each handled by an exported S3 generic function: download_scr(), import_src() and attach_src().The first two steps, download and import, are one-time procedures, whereas attaching is carried out every time the package namespace is loaded.By default, all data sources known to ricu are configured to be attached and in case some data is missing for a given data source, the missing data is downloaded and imported on first access.For data download, several environment variables can be configured: • RICU_PHYSIONET_USER/RICU_PHYSIONET_PASS: PhysioNet user name and password with access to the requested dataset.• RICU_AUMC_TOKEN: Download token, extracted from the download URL received when requesting data access.
If any of the required access credentials are not available as environment variables, the user is queried in interactive sessions.Each of the datasets requires 5-10 GB disk space for permanent storage.Additionally, 50-100 GB of temporary disk storage is required during download and import of any of the datasets.Memory requirements are kept low by performing all set-up operations only on subsets of rows at the time, such that 8 GB of memory should suffice.Initial data source set up (depending on available download speeds and CPU/disk type) may take upwards of an hour per dataset.
Further environment variables can be set to customize certain aspects of ricu data handling: • RICU_DATA_PATH: Data storage location (can be queried by calling data_dir()).
• RICU_CONFIG_PATH: Comma-separated paths to directories containing configuration files (in addition to the default location; retrievable using config_paths()).• RICU_SRC_LOAD: Comma-separated data source names that are set up for being automatically attached on namespace loading (the current set of data sources is available as auto_attach_srcs()).
After successful data download, importing prepares tables for efficient random row-access, for which the raw data format (.csv) is not well suited.Tables are read in using readr (Wickham and Hester 2020), potentially (re-)partitioned row-wise, and re-saved using fst.
Finally, attaching a dataset creates a corresponding src_env object which together with associated meta-data is used by ricu to run queries against the data.

Data loading
The lowest level of data access is direct subsetting of src_tbl objects as shown at the start of this section (3.2).Building on that, several S3 generic functions successively homogenize data representations, starting with load_src(), which provides a string-based interface to subset() for all but the row-subsetting expression.
Building on load_difftime() functionality, load_id() (and analogously load_ts()) returns an id_tbl (or ts_tbl) object with the requested ID system (passed as id_var argument).This uses raw data IDs if available or calls change_id() in order to convert to the desired ID system.Similarly, where load_difftime() returns data with fixed time interval of one minute, load_id() allows for arbitrary time intervals (using change_interval()).

Data source configuration
Data source environments (and corresponding src_tbl objects) are constructed using source configuration objects: list-based structures, inheriting from src_cfg and from any number of data source-specific class names with suffix _cfg appended (as discussed at the beginning of Section 3.2).The exported function load_src_cfg() reads a JSON formatted file using jsonlite (Ooms 2014), and creates a src_cfg object per datasource and further therein contained objects.
ID configuration An id_cfg object contains an ordered set of key-value pairs representing patient ID systems in a dataset.An implicit assumption currently is that a given patient ID system is used consistently throughout a dataset, meaning that for example an ICU stay ID is always referred to by the same name throughout all tables containing a corresponding column.Owing to the relational origins of these datasets this has been fulfilled in all instances encountered so far.In MIMIC-III, ID systems are available, allowing for identification of individual patients, their (potentially multiple) hospital admissions over the course of the years and their corresponding ICU admissions (as well as potential re-admissions).Ordering corresponds to cardinality: moving to larger values implies moving along a one-to-many relationship.This information is used in data-loading, whenever the target ID system is not contained in the raw data.

Default column configuration
Again used in data loading, this per-table set of key-value pairs specifies column defaults as col_cfg object.Each key describes a type of column with special meaning and the corresponding value specifies said column for a given table.

R> as_col_cfg(mi_cfg)
<col_cfg<mimic_demo[id_var, index_var, time_vars, unit_var, val_var]> The following column defaults are currently in use throughout ricu but the set of keys can be extended to arbitrary new values: • id_var: In case a table does not contain at least one ID column corresponding to one of the ID systems specified as id_cfg, the default ID column can be set on a per-table basis as id_var4 .• index_var: A column that is used to define an ordering in time over rows, thereby providing a time-series index5 .• time_vars: Columns which will be treated as time variables (important for converting between ID systems for example), but not as time-series indices6 .• unit_var: Used in concept loading (more specifically for num_cncpt concepts, see Section 4.1) to identify columns that represent unit of measurement information.• val_var: Again used when loading data concepts, this identified a default value variable in a table, representing the column of interest to be used as returned data column.
While id_var, index_var and time_vars are used to provide sensible defaults to functions used for general data loading (Section 3.2.2),unit_var, val_var, as well as potential userdefined defaults are only used in concept loading (see Section 4.3) and therefore need not be prioritized when integrating new data sources until data concepts have been mapped.

Table configuration
Finally, tbl_cfg objects are used during the initial set-up of a data source.In order to create a representation of a table that is accessible from ricu from raw data, several key pieces of information are required: • File name(s): In the simplest case, a single file corresponds to a single table.Other scenarios that have been encountered (and are therefore handled) include tables partitioned into multiple files and .tararchives containing multiple tables.
• Column specification: For each column, the expected data type has to be known, as well as a pair of names, one corresponding to the raw data column name and one corresponding to the column name to be used within ricu.
• (Optional) number of rows: Used as sanity check whenever available.
• (Optional) partitioning information: For very long tables it can be useful to specify a row-partitioning.This currently is only possible by applying a vector of breakpoints to a single numeric column, thereby defining a grouping.

R> as_tbl_cfg(mi_cfg)
<tbl_cfg<mimic_demo For the chartevents table of the MIMIC-III demo dataset, for example, rows are partitioned into two groups, while all other tables are represented by a single partition.Furthermore, the expected number of rows is unknown (??) as this is missing from the corresponding tbl_cfg object.

Adding external datasets
In order to add a new dataset to ricu, several aspects outlined in the previous subsections require consideration.For illustration purposes, code for integrating AmsterdamUMCdb as external dataset is available from Github.While this is no longer needed for using the aumc data source, the repository will remain as it might serve as template to integration of new datasets.
Using a configuration file as described in Section 3.2.3 and pointing the RICU_CONFIG_PATH environment variable at its location, data can be prepared for use with ricu using import_src() and made available to ricu using attach_src().In addition to providing a configuration file, dataset-specific implementations of some of the S3 generic functions involved in data-loading might be required.In the case of AmsterdamUMCdb, a class-specific implementation of load_difftime() is required, as raw time-stamps are recorded in milliseconds (instead of min-utes), as well as a class-specific implementation of the S3 generic function id_win_helper() 7 .

Data concepts
One of the key components of ricu is a scheme for specifying how to retrieve data corresponding to pre-defined clinical concepts from a given data source, in turn enabling dataset agnostic code for analysis.Heart rate, for example can be loaded using the hr concept as R> load_concepts("hr", c("mimic_demo", "eicu_demo"), verbose = FALSE) This requires some form of infrastructure for concisely specifying how to retrieve data subsets (Section 4.1), which is both extensible (to new concepts and new datasets) and flexible enough to handle concept-specific pre-processing.Additionally, ricu has included a dictionary with over 100 concepts implemented for all four supported datasets (where possible; see also Section 4.3).A quick remark on terminology before diving into more details on how to specify data concepts: A concept corresponds to a clinical variable such as a bilirubin measurement or the ventilation status of a patient, and an item encodes how to retrieve data corresponding to a given concept from a data source.A concept therefore contains several items (zero, one or several are possible per data source).

Concept specification
7 This method is used as part of change_id() which is called whenever the requested ID system is not available in raw data.A component central to conversion between patient ID systems is a table which contains patient IDs as columns alongside columns with start and end-points for each, thereby specifying a mapping between ID systems.Dataset-specific construction of such a table is handled by the S3 generic function id_win_helper().As construction of such tables can be expensive (involving several merge operations of tables with 10 4 -10 5 rows) and used frequently (potentially with every single data request), the resulting table is cached in memory with session persistence.
Similarly to data source configuration (discussed in Section 3.2.3),concept specification relies on JSON-formatted text files.A default dictionary of concepts is included with ricu containing a selection of commonly used clinical concepts.Several types of concepts exist within ricu and with extensibility in mind, new types can easily be added.
All concepts consist of minimal meta-data including a name, target class (defaults to ts_tbl; see Section 4.2), an aggregation specification 8 and class information (defaults to num_concept), as well as optional description and category information.Adding to that, depending on concept class, further fields can be added.In the case of the most widespread concept type (num_cncpt; used to represent numeric data) this is unit which encodes one (or several synonymous) unit(s) of measurement, as well as a minimal and maximal plausible values (specified as min and max).The concept for heart rate data (hr) for example can be specified as { "hr": { "unit": ["bpm", "/min"], "min": 0, "max": 300, "description": "heart rate", "category": "routine vital signs", "sources": { ... } } } Meta-data is used during concept loading for data-preprocessing.For numeric concepts, the specified measurement unit is compared to that of the data (if available), with messages being displayed in case of mismatches, while the range of plausible values is used to filter out measurements that fall outside the specified interval.Other types of concepts include categorical concepts (fct_cncpt), concept representing binary data (lgl_cncpt), as well as recursive concepts (rec_cncpt), which build on other atomic concepts 9 .Specification of how data can be retrieved from a data source is encoded by data items.Lists of data items (associated with data source names) are provided as sources element (instead 8 Every concept needs a default aggregation method which can be used during data loading to return data that is unique per key (either per id_vars group or per combination of ìd_vars and index_var) otherwise down-stream merging of multiple concepts is ill-defined.The aggregation default can be overridden during loading or as specification of a rec_cncpt object.If no aggregation method is explicitly indicated the global default is first() for character, median() for numeric and sum() for logical vectors.For logical data, if a concept of type lgl_cncpt is used, the count of TRUE values is converted back to logical, thereby providing any() type functionality.
9 An example for a recursive concept is the PaO2/FiO2 ratio, used for instance to assess patients with acute respiratory distress syndrome (ARDS) or for sepsis-related organ failure assessment (SOFA) (Villar, Pérez-Méndez, Blanco, Añón, Blanch, Belda, Santos-Bouza, Fernández, Kacmarek, and Spanish Initiative for Epidemiology and Therapies for ARDS (SIESTA) Network 2013; Vincent et al. 1996).Given both PaO2 and FiO2 as individual concepts, the PaO2/FiO2 ratio is provided by ricu as a recursive concept (pafi), requesting the two atomic concepts pao2 and fio2 and performing some form of imputation for when at a given time step one or both values are missing. of ... in the above code block).For the demo datasets corresponding eICU and MIMIC-III, heart rate data retrieval is specified as { "eicu_demo": [ { "table": "vitalperiodic", "val_var": "heartrate", "class": "col_itm" } ], "mimic_demo": [ { "ids": [211, 220045], "table": "chartevents", "sub_var": "itemid" } ] } Analogously to how different types of concepts are used to represent different types of data, different types of items handle different types of data loading.The most common scenario is selecting a subset of rows from a table by matching a set of ID values (sub_itm).In the above example, heart rate data in MIMIC-III can be located by searching for ID values 211 and 220045 in column itemid of table chartevents (heart rate data is stored in long format).Conversely, heart rate data in eICU is stored in wide format, requiring no row-subsetting.Column heartrate of table vitalperiodic contains all corresponding data and such data situations are handled by the col_itm class.Other item classes include rgx_itm where a regular expression is used for selecting rows and fun_itm where an arbitrary function can be used for data loading.If a data loading scenario is not covered by these classes, adding further itm subclasses is encouraged.
In order to extend the current concept library both to new datasets and new concepts, further JSON files can be incorporated by adding their paths to RICU_CONFIG_PATH.Concepts with names that already exist are only used for their sources entries, such that hr for new_dataset can be specified as "hr": { "sources": { "new_dataset": [ { "ids": 6640, "table": "numericitems", "sub_var": "itemid" } ] } } whereas concepts with non-existing names are treated as new concepts.
Central to providing the required flexibility for loading of certain data concepts that require some specific pre-processing are callback functions that can be specified for several item types.Functions (with appropriate signatures), designated as callback functions, are invoked on individual data items, before concept-related preprocessing is applied.A common scenario for this is unit of measurement conversion: In MIMIC-III data for example, several itemid values correspond to temperature measurements, some of which refer to temperatures measured in degrees Celsius whereas others are used for measurements in degrees Fahrenheit.As the information encoding which measurement corresponds to which itemid values is no longer available during concept-related preprocessing, this is best resolved at the level of individual data items.Several function factories are available for generating callback functions and convert_unit() is intended for covering unit conversions.Data items corresponding to the temp concept for MIMIC-III are specified as { "mimic_demo": [ { "ids": [676, 677, 223762], "table": "chartevents", "sub_var": "itemid" }, { "ids": [678,679,223761,224027], "table": "chartevents", "sub_var": "itemid", "callback": "convert_unit(fahr_to_cels, C , f )" } ] } indicating that for ID values 676, 677 and 223762 no pre-processing is required and for the remaining ID values the function fahr_to_cels() is applied to entries of the val_var column where the regular expression "f" is TRUE for the unit_var column (the values of which being ultimately replaced with "C").

Data classes
In order to represent tabular ICU data, ricu provides several classes, all inheriting from data.table.The most basic of which, id_tbl, marks one (or several) columns as id_vars which serve to define a grouping (i.e.identify patients or unit stays).Inheriting from id_tbl, ts_tbl is capable of representing grouped time-series data.In addition to id_var column(s), a single column is marked as index_var and is required to hold a base R difftime vector.Furthermore, ts_tbl contains a scalar-valued difftime object as interval attribute, specifying the time-series step size 10 .Meta data is transiently added to data.tableobjects by classes inheriting from id_tbl and S3 generic functions which allow for object modifications, down-casting is implicit: Due to time-series step size of dat being specified as 1 hour, an internal inconsistency is encountered when shifting time stamps by 30 minutes, as time-steps are no longer multiples of the time-series interval, in turn causing down-casting to id_tbl.If column a were to be removed, direct down-casting to data.table would be required in order to resolve inconsistencies 11 .
Utilizing the attached meta-data, several utility functions can be called with concise semantics.This includes functions for sorting, checking for duplicates, aggregating data per combination of id_vars (and time-step), checking time series data for gaps, verifying whether the time-series is regular and converting between irregular and regular time-series, as well as functions for several types of moving window operations.Adding to those class-specific implementations, id_tbl objects inherit from data.table (and therefore from data.frame), ensuring compatibility with a wide range of functionality targeted at these base-classes.

Clinical concepts
11 Updating an object inheriting from id_tbl using data.table::set()bypasses consistency checks as this is not an S3 generic function and therefore its behavior cannot be tailored to requirements of id_tbl objects.It therefore is up to the user to avoid vitiating id_tbl objects in such a way.
The current selection of clinical concepts that is included with ricu covers many physiological variables that are available throughout the included datasets.Treatment-related information on the other hand, being more heterogeneous in nature and therefore harder to harmonize across datasets, has been added on an as-needed basis and therefore is more limited in breadth.
Available concepts can be enumerated using load_dictionary() and the utility function explain_dictionary() can be used to display some concept meta-data.
R> dict <-load_dictionary(c("mimic_demo", "eicu_demo")) R> head(dict) The following sub-sections serve to introduce some of the included concepts as well as highlight limitations that come with current implementations.Grouping the available concepts by category yields the following counts R> table (vapply(dict, [[ , character(1L), "category"))

Physiological data
The largest and most well established group of concepts (covering more than half of all currently included concepts) includes physiological patient measurements such as routine vital signs, respiratory variables, fluid discharge amounts, as well as many kinds of laboratory tests including blood gas measurements, chemical analysis of body fluids and hematology assays.
R> load_concepts(c("alb", "glu"), "mimic_demo", interval = mins(15L Most concepts of this kind are represented by num_cncpt objects with an associated unit of measurement and a range of permissible values.Data is mainly returned as ts_tbl objects, representing time-dependent observations.Apart from conversion to a common unit (possibly using the convert_unit() callback function), little has to be done in terms of pre-processing: values are simply reported at time-points rounded to the requested interval.

Patient demographics
Moving on from dynamic, time-varying patient data, this group of concepts focuses on static patient information.While the assumption of remaining constant throughout a stay is likely to hold for variables such as patient sex or height this is only approximately true for others including age or weight.Nevertheless such effects are ignored and concepts of this group will be mainly returned as id_tbl objects with no corresponding time-stamps included.
Whenever requesting concepts which are returned with associated time-stamps (e.g.glucose) alongside time-constant data (e.g.age), merging will duplicate static data over all time-points.
R> load_concepts(c("age", "glu"), "mimic_demo", verbose = FALSE) Despite a best-effort approach, data availability can be a limiting factor.While for physiological variables, there is good agreement even across continents, data-privacy considerations, as well as lack of a common standard for data encoding, may cause issues that are hard to resolve.In some cases, this can be somewhat mitigated while in others, this is a limitation to be kept in mind.In AmsterdamUMCdb, for example, patient age, height and weight are not available as continuous variables, but as factor with patients binned into groups.Such variables are then approximated by returning the respective mid-points of groups for aumc data12 .Other concepts, such as adm (categorizing admission types) or a prospective icd concept (diagnoses as ICD-9 codes) can only return data if available from the data source in question.Unfortunately, neither aumc nor hirid contain ICD-9 encoded diagnoses, and in the case of hirid, no diagnosis information is available at all.

Treatment-related information
The largest group of concepts dealing with treatment-related information is described by the medications category.In addition to drug administrations, only basic ventilation information is currently provided as ready to use concept.Just like availability of common ICU procedures, patient medication is also underdeveloped, covering mainly vasopressor administrations, as well as corticosteroids and antibiotics.The current concepts retrieving treatment-related information are mostly focused on providing data required for constructing clinical scores described in Section 4.3.4.
Ventilation is represented by several concepts: a ventilation indicator variable (vent_ind), as well as ventilation durations (vent_dur) are constructed from start and events (vent_start and vent_end).This includes any kind of mechanical ventilation (invasive via an endotracheal or tracheostomy tube), as well as non-invasive ventilation via face or nasal masks.In line with other concepts belonging to this group, the current state is far from being comprehensive and expansion to further ventilation parameters is desirable.
The singular concept addressing antibiotics (abx) returns an indicator signaling whenever an antibiotic was administered.This includes any route of administration (intravenous, oral, topical, etc.) and does neither report dosage, nor active ingredient.Finally, vasopressor administration is reported by several concepts representing different vasoactive drugs (including dopamine, dobutamine, epinephrine, noreponephrine and vasopressin), as well as different administration aspects such as rate, duration (and for use in SOFA scoring, rate administered for at least 60 minutes).
R> load_concepts(c("abx", "vent_ind", "norepi_rate", "norepi_dur"), + "mimic_demo", verbose = FALSE) As cautioned in Section 4.3.2,variability in data reporting across datasets can lead to issues: the prescriptions table included with MIMIC-III, for example, reports time-stamps as dates only, yielding a discrepancy of up to 24 hours when merged with data where time-accuracy is on the order of minutes.This effect is somewhat mitigated by shifting time-stamps from midnight to mid-day, but the underlying accuracy issue of course remains.Another problem exists with concepts that attempt to report administration windows, as some datasets do not describe infusions with clear cut start/endpoints but rather report infusion parameters at (somewhat) regular time intervals.This can cause artifacts when the requested time step-size deviates from the dataset inherent time grid.

Outcomes
A group of more loosely associated concepts can be used to describe patient state.This includes common clinical endpoints, such as death or length of ICU stay, as well as scoring systems such as SOFA, the systemic inflammatory response syndrome (SIRS; Bone, Sibbald, and Sprung 1992) criterion, the National Early Warning Score (NEWS; Jones 2012) and the Modified Early Warning Score (MEWS; Subbe, Kruger, Rutherford, and Gemmel 2001).
While the more straightforward outcomes can be retrieved directly from data, clinical scores often incorporate multiple variables, based upon which a numeric score is constructed.This can typically be achieved by using concepts of type rec_cncpt, specifying the needed components and supplying a callback function that applies rules for score construction.
R> load_concepts(c("sirs", "death"), "mimic_demo", verbose = FALSE, + keep_components = TRUE) Callback functions can become rather involved (especially for more complex concepts such as SOFA) and may include arbitrary arguments to tune their behavior.As callback functions to rec_cncpt objects are typically called internally from load_concepts(), arguments not used by load_concepts(), such as keep_components in the above example (causing not only the score column, but also individual score components to be retained) are forwarded13 .

Examples
In order to briefly illustrate how ricu could be applied to real-world clinical questions, two toy examples are provided in the following sections.While the first example fully relies on data concepts that are included with ricu, the second one explores both how some data preprocessing can be added to an existing concept by creating a new rec_cncpt and how to create an new data concept altogether.

Lactate and mortality
First, we investigate the association of lactate levels and mortality.This problem has been studied before and it is widely accepted that both static and dynamic lactate indices are associated with increased mortality (Haas, Lange, Saugel, Petzoldt, Fuhrmann, Metschke, and Kluge 2016;Nichol, Bailey, Egi, Pettila, French, Stachowski, Reade, Cooper, and Bellomo 2011;Van Beest, Brander, Jansen, Rommes, Kuiper, and Spronk 2013).In order to model this relationship, we fit a time-varying proportional hazards Cox model (Therneau and Grambsch 2000;Therneau and Lumley 2015), which includes the SOFA score as a general predictor of illness severity, using MIMIC-III data.Furthermore, for the sake of this example, we are only interested in patients admitted from 2008 onwards of ages 25 to 65 years old.
R> src <-"mimic" R> R> cohort <-load_id("icustays", src, dbsource == "metavision", + cols = NULL) R> cohort <-load_concepts("age", src, patient_ids = cohort, + verbose = FALSE) R> R> dat <-load_concepts(c("lact", "death", "sofa", "sex"), src, After loading the data, some minor pre-processing is still required before modeling: first, we want to make sure we only use data up to (and including) the hour in which the death flag switches to TRUE.After that we impute missing values for lact using a last observation carry forward (locf) scheme (observing the patient grouping) and we simply replace missing death values with the value FALSE.The resulting model fit can be visualized as: A simple exploration already shows that the increased values of lactate are associated with mortality, even after adjusting for the SOFA score.

Diabetes and insulin treatment
For the next example, again using MIMIC-III data, we turn to the usage of co-morbidities and treatment related information.We look at the amount of insulin administered to patients in the first 24 hours from their ICU admission.In particular, we investigate if diabetic patients receive more insulin in the first day of their stay compared to non-diabetic patients.For this we create two concepts: ins24, a binned variable representing the cumulative amount of insulin administered within the first 24 hours of an ICU admission, and diab, a logical variable encoding diabetes co-morbidity.
As there already is an insulin concept available, ins24 can be implemented as rec_cncpt, loading ins with aggregation set to sum() (instead of median()) and inserting the callback function ins_cb() into the loading process.The callback function takes care of the preprocessing steps outlined above: first data is subsetted to fall into the the first 24 hours of ICU admissions, followed by binning of summed values.] + + ins + } R> R> ins24 <-load_dictionary(src, "ins") R> ins24 <-concept("ins24", ins24, "insulin in first 24h", aggregate = "sum", + callback = ins_cb, target = "id_tbl", class = "rec_cncpt") The binary diabetes concept can be implemented as lgl_cncpt, for which ICD-9 codes are matched using a regular expression.As we're not only interested in retrieving diabetic patients, a col_itm is more suited for data retrieval over an rgx_itm and for creating the required callback function that produces a logical vector we can use transform_fun() coupled with a function like grep_diab().The two concepts are then combined using c() and loaded via load_concepts().

Table 1 :
structure.Comparison of datasets supported by ricu, highlighting some of the major similarities and distinguishing features among the four data sources described in the preceding paragraphs.Values followed by parenthesized ranges represent medians and are accompanied by quartiles.
* These values represent the number of atomic concepts per data source.Additionally, 29 recursive concepts are available, which build on source-specific atomic concepts in a source-agnostic manner (see Section 4.1 for details).
table, which returns a src_tbl object representing the requested table.As is the case for src_env objects, src_tbl objects inherit from additional classes such that certain per-dataset behavior can be customized.The admissions table of the MIMIC-III demo dataset for example inherits from mimic_demo_tbl and mimic_tbl (alongside classes src_tbl and prt).This syntax makes it possible to read row-subsets of long tables into memory with little memory overhead.While terseness of such an API does introduce potential ambiguity, this is mostly overcome by using the tidy eval framework provided by rlang (Henry and Wickham 2020): Powered by the prt (Bennett 2021) package, src_tbl objects represent row-partitioned tabular data stored as multiple binary files created by the fst (Klik 2020) package.In addition to standard subsetting, prt objects can be subsetted via the base R S3 generic function subset() and using non-standard evaluation:R> subset(mimic_demo$admissions, subject_id > 44000, language:ethnicity)