Privacy-protecting, reliable response data discovery using COVID-19 patient observations

Abstract Objective To utilize, in an individual and institutional privacy-preserving manner, electronic health record (EHR) data from 202 hospitals by analyzing answers to COVID-19-related questions and posting these answers online. Materials and Methods We developed a distributed, federated network of 12 health systems that harmonized their EHRs and submitted aggregate answers to consortia questions posted at https://www.covid19questions.org. Our consortium developed processes and implemented distributed algorithms to produce answers to a variety of questions. We were able to generate counts, descriptive statistics, and build a multivariate, iterative regression model without centralizing individual-level data. Results Our public website contains answers to various clinical questions, a web form for users to ask questions in natural language, and a list of items that are currently pending responses. The results show, for example, that patients who were taking angiotensin-converting enzyme inhibitors and angiotensin II receptor blockers, within the year before admission, had lower unadjusted in-hospital mortality rates. We also showed that, when adjusted for, age, sex, and ethnicity were not significantly associated with mortality. We demonstrated that it is possible to answer questions about COVID-19 using EHR data from systems that have different policies and must follow various regulations, without moving data out of their health systems. Discussion and Conclusions We present an alternative or a complement to centralized COVID-19 registries of EHR data. We can use multivariate distributed logistic regression on observations recorded in the process of care to generate results without transferring individual-level data outside the health systems.


INTRODUCTION
The COVID-19 pandemic represents a watershed event in public health and has highlighted numerous opportunities and needs in clinical and public health informatics infrastructure. [1][2][3] One of the key challenges is the rapid response of analyses and interpretation of observational data to inform clinical decision-making and patient expectations, understanding, and perceptions. [4][5][6][7][8] Several initiatives are building COVID-19 registries or consortia to analyze electronic health record (EHR) data. 7 The expectation is that these resources will provide researchers and clinicians access to a rich source of observational data to understand the clinical progression of COVID-19, to estimate the impact of therapies, and to make predictions regarding outcomes. Registries may contain limited data for patients diagnosed with COVID-19: the barriers to having more data are based on both privacy concerns and on what elements have been deemed valuable by health professionals and researchers at a particular point in time. The problems with a new and evolving disease like COVID-19 is that we do not know what data or information will be most valuable. For example, in the pandemic's early stages, the dermatological and hematological findings were not evident, and those data were not included in registries or reports. 9 Interest in specific laboratory markers (eg, D-dimer, troponin) for these disturbances and additional medications (eg, antihypertensive drugs) or phenotypes (eg, diabetes, blood type) has increased over time. [10][11][12] Additionally, it is challenging for researchers and clinicians to understand the structure and quality of the data in data repositories, and to formulate queries to consult the data in their institution and in others.
Thus, the utilization of EHRs to characterize COVID-19 disease progression and outcomes is challenging. However, EHR data may be useful when a randomized clinical trial cannot be conducted. Observational data may also help determine if results from a randomized clinical trial replicate after relaxing eligibility criteria for real-world applications. While the scientific community has raised concerns about the reproducibility of findings, data provenance, and proper utilization of observational data, resulting in some COVID-19 articles being retracted, 13 there remains a clear need to responsibly, ethically, and transparently analyze observational data to provide hypothesis generation and guidance in the pursuit of evidence-based healthcare.
In this study, we focus on using novel decentralized data governance and methods to analyze EHR-derived data.

MATERIALS AND METHODS
Researchers' questions posed in natural language are answered by distributed data maintained in 12 health systems, covering 202 hospitals located in all US states and two territories and one international aca-demic medical center (Table 1). This collaboration provides the capability for comparisons with historical data from over 45 million patients and uses a dynamic approach to account for an evolving awareness of the most impactful COVID-19 questions to answer and hypotheses to explore. All sites have transformed or are actively transforming data into the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), but some of them only use data from COVID-19 registry patients (ie, do not transform the full EHR-based data warehouse), and others only have the items required by the query in OMOP. The ability to build and evaluate multivariate models across a large number of health systems and integrate results from registries differentiates our approach from most federated clinical data research network approaches.
The development of our Q&A system involved the inclusion of new concept codes in local repositories, agreement on concept definitions (eg, what constitutes a COVID-19 hospitalization, what codes should be included in the definition of History of Coronary Heart Disease, and how to map laboratory test records into LOINC, for which we developed a mapping tool). 14 Instead of a singular control of a coordinating center, the R2D2 consortium allows participating institutions to "own" the development and testing of queries across various sites, which promotes a balanced division of workload and increases the ability of individual sites to develop generalizable queries and manage responses with help from the whole consortium. The translation of questions into code relies on members of the Reliable Response Data Discovery for COVID-19 (R2D2) Consortium. The analyses performed on data transformed into the OMOP CDM from relevant patient cohorts do not require data transfer outside the participating institutions and reduce the risk of individual or institutional privacy breaches. After a partially automated quality control process, which is carefully reviewed by multiple consortium members, only the results of calculations (eg, counts, statistics, coefficients, variance-covariance matrices) are released from the healthcare institutions; no individual patient-level data are shared. 15 Workflow Figure 1 shows our general workflow, including human interpretation and clarification of questions and human quality control of answers, using graphs and related visualizations as much as possible. The responsibility of the Lead Site-to create a template query for all responding sites to use for rapid response-rotates among institutions (ie, health systems). A more detailed workflow is illustrated in Figure 2 using a swim-lane format with an emphasis on roles. The Q&A process starts when a user creates a request through the public website, https://covid19questions.org. Next, the data scientist at the Consortium Hub verifies whether this question had been answered before and passes it on to the clinician at the Consortium Hub to assess the feasibility (ie, if the received question is answerable from the local data mart), who then assigns it to one of the 12 institutions as a Lead Site. Throughout the whole process, the tracking system is used to report an issue to assignees, respond to the issue, update the code and results, and prompt to rerun the updated structured query language (SQL). Next, another clinician at the Lead Site works with the local database analyst to review and develop a concept set. This is an iterative process within the Lead Site: to develop a concept set, create SQL, generate results, and evaluate the results against the EHR records, including chart reviews. The outputs of the Lead Site-level process are a template query (.sql format) and a template output (.csv format), which are uploaded to the shared code repository.
Once 11 Responding Sites get notification e-mails about the template query and format for the results, their database analysts will run the template SQL to get preliminary results and review these against their EHR data with clinicians. This part of the process is where the Responding Site most frequently runs into errors and challenges and requires troubleshooting. For example, when missing concepts, like D-dimer or blood type (illustrated in Figure 3), are discovered, the database analyst at the Responding Site creates an issue in the tracking system and resolves this with the database analyst and the clinician at the Leading Site. Since there are 11 Responding Sites, this means the Lead Site coordinates the concept set and SQL development through one-on-one sessions between the Lead Site and Responding Site. Through this iterative process among 12 sites, the concept set and SQL are continuously updated, improving their sensitivity and specificity to identify the right patients and hospitalization encounter records. This involves rewriting and updating existing extract-transform-load (ETL) scripts to map source EHR data to target the common data model (CDM, which in our case is the Observational Medical Outcomes Partnership, OMOP). 16 The institutions with the same EHR system or database management sys-tem share common experience and knowledge to help each other develop ETL scripts together and evaluate the OMOP query results against EHRs.
When all Responding Sites have uploaded their site-level results, the data scientist at Consortium Hub merges these results into a single file. A generic and extensible format for site-level summary result is used to answer general epidemiology and clinical research questions ( Figure 4). Then a data quality check is conducted. While use of a CDM in a large clinical data research network is a widely used approach to enable interoperable query development, a query formulated in 1 institution may not return accurate results in another due to variations in data integration and data quality differences. Several rounds of confirmations and checks with data analysts and clinical informaticians at each institution are often necessary to answer questions with confidence. There are many potential sources of errors, and Table 2 displays selected examples of data quality checks. The check types are based on the PEDSnet framework 17 and revised to fit our project's specific needs. The data scientist resolves issues together with the Lead Site and the Responding Site. When the aggregate results pass the quality control test, the Consortium Hub clinician conducts the final review to ensure its clinical relevancy. During several rounds of code releases and responses among the Lead Site and the Responding Site, database developers rewrite their ETL scripts to increase the accuracy of the query results. Finally, if the clinician approves the release of the result, the data scientist uploads the answer to the public website (https://covid19questions.org), notifies the requestor via an e-mail, and this completes the workflow. Quality improvement-related steps and data visualization are either semiautomated or manually conducted. ETL refresh, initial data quality check, and data aggregation are automated with scheduling scripts.

Federated regression
In addition to count queries, we also applied Grid Binary LOgistic REgression (GLORE) 15 to compute the effect of the exposure vari- Two additional sites joined the consortium and will begin answering queries in 2021.
able on the outcome, adjusted for confounders, without sharing patient-level data, as this would increase the risk for a privacy breach. We rewrote the Newton-Raphson method to find the maximizer of the likelihood function of the parameters in logistic regression for horizontally partitioned datasets. Since the first and the second derivatives of the log likelihood functions are separable (ie, they can be partially calculated at each site), in each Newton-Raphson iteration, each client institute calculated a (p þ 1) dimensional vector of parameters, where p is the number of features in the model such as age, sex, and race and a (p þ 1) by (p þ 1) variancecovariance matrix; then JSON files containing these two objects are sent to the Consortium Hub. At each iteration, the Consortium Hub automatically updates the global coefficient vector and the variancecovariance matrix and sends them back to the clients.

RESULTS
Between 12/11/2020 and 8/31/2020, our consortium had 928 255 tested patients for SARS-CoV-2, 59 074 diagnosed with COVID-19, with 19 022 hospitalized and 2591 deceased. Our public questions and answers portal (https://covid19questions.org) provides answers to research questions using several univariate or multivariate analyses, including potential associations between mortality and comorbidities; prehospitalization use of medications; laboratory values; and hospital events. For each question, we report on the number of participating institutions and the time period within which local queries were run. Figures 4-6 illustrate the answers. Example 1. "Many adult COVID-19 patients who were hospitalized did not get admitted to the ICU and were discharged alive. How many returned to the hospital within a week, either to the emergency room or for another hospital stay?" This question is important both from the standpoint of understanding the natural course of disease and planning for needed resources. Although efforts are underway to understand postdischarge outcomes in COVID-19 infected patients, to date they have been limited to case series, 18 modest sample sizes, 19 or single-center or geographically concentrated health systems. 20 These extant studies may also be hampered by fixed inclusion/exclusion criteria. 21 Example 2. "Among adults hospitalized with COVID- 19  ity, sex, and race)?" The answers from univariate analyses indicate that age, ethnicity, and sex are significant. A distributed logistic regression ( Figure 6) shows, among these, that only age is significant. There is great interest and growing peer-reviewed literature on risk factors for COVID-19 mortality; the agility of our approach allows us to quickly rerun queries and rebuild models as new predictors become relevant and the understanding of the disease evolves. 20,22,23

Cohort and concept set
As questions frequently refer to the same subsets of patients, we developed electronic cohort definitions that facilitate our answers. We followed the US Centers for Disease Control and Prevention (CDC) guideline 24 and the National COVID-19 Cohort Collaborative 25 and Observational Health Data Sciences and Informatics (OHDSI) approaches 26 to develop a cohort of hospitalization encounters for Figure 2. Swimlane diagram. A Q&A process flow starts from a user entering a request and ends with the user receiving e-mail notification about a response. At the Consortium Hub, the data scientist is responsible for aggregating site-level results and for data quality checks. The clinician at the Consortium Hub is responsible for feasibility assessment of the question, triaging to a Lead Site, and for the approval of the aggregate answer. At the Lead Site, the clinician reviews the assigned question text and works with the database analyst to translate the question into SQL and ensure the results are clinically relevant. The database analyst at the Lead Site writes the SQL code, runs it, verifies the results, and releases the code to the Consortium Hub. At the Responding Site, the database analyst runs the Lead Site's SQL code, reviews the results together with local clinicians, and uploads the site-level results to the Consortium Hub through an iterative process of ETL update, local data mapping, and concept set development led by the Lead Site.
COVID-19 as a base for all inpatient questions. Through an iterative process among multiple sites, we developed a canonical SQL whose results match with that of the ground truth cohort definition. The intersection of the R2D2 canonical SQL, the private reference (ie, EHR-or registry-based) and the universal reference (ie, a positive polymerase chain reaction test for SARS-COV-2) was maximized for existing and new sites. Figure 3A displays the electronic phenotyping of adults hospitalized with COVID-19 derived by the canonical SQL and stored procedure SQL scripts. Hospitalization encounters were identified by using the following concepts stored in the OMOP <VISIT_OCCURRENCE>  . Cohort definition and concept set development. Defining a cohort of patients that is frequently used to answer questions helps us reuse code. In this example, defining the cohort of patients hospitalized with COVID-19 involves use of SARS-CoV-2 test results or diagnosis codes (A). In (B), we illustrate how a laboratory test is defined differently at two sites and how blood type had yet to be harmonized into OMOP at one site.
Concept Id: 706163) and at least one value_as_concept_id for a positive result (eg, Concept Name: Positive, Concept Id: 45884084).
Use cases of concept set are shown in Figure 3B.  query developed at the Lead Site might miss a concept used in other sites. In such a case, the database analyst at the Responding Site notifies the Lead Site by creating a GitHub issue, with zero or unexpectedly low count or proportion in the results generated by the initial template query authored by the Lead Site. For example, in Figure 3B, during the Concept Set development for the quantitative laboratory measurement D-dimer, the responding site notified the Lead site about using another concept for D-dimer, (Concept ID: 3048540 instead of Concept ID: 3051714), returning values with a different measurement unit than the ones of the Lead Site (n/L instead of mg/L). Therefore, the Lead Site had to add the missing concept to the Concept Set and implemented logic to cover a measurement unit transformation. In the case of the Concept Set development for blood type, a responding site was missing concepts for blood type in its local OMOP CDM database. An ETL script was implemented to map EHR data to OMOP CDM. Sources of discrepancy were diverse; examples included unit differences in measurement values, differently mapped concepts, and noncompliance to the coding guideline. All SQL codes and concept sets for answered questions are publicly available from the GitHub repository: https://github.com/DBMI/R2D2-Public. The public repository is updated whenever a new question and its answer get posted on the public website. The similarities and the differences of our approach to other consortia are detailed in the Supplementary Material. Supplementary Figure 1 shows the screen shot of the real example JSON file used during the GLORE run to answer the in-hospital mortality question. No patient-level information was shared or transferred between institutions. All clients repeatedly sent the updated JSON file to the Consortium Hub until the estimates stabilized or reached a predefined number of iterations. To enhance the security, the Consortium Hub server allowed (ie, "white listed") only the preregistered IP addresses of client machines and opened the port only during the scheduled time window.
Several other questions and answers are shown in the portal. A novel governance structure (Figures 1 and 2) allows us to distribute the workload across various teams without relying on a traditional coordinating center, instead including a Consortium Hub. This approach keeps patient data in-house, simplifies data use agreements, avoids delegation of control of patient data to another institution, and allows any institution to benchmark its results to those produced by the consortium, since all questions and respective final, aggregated answers, database query code, concept definitions, and analytics code are made public. It complies with HIPAA, 27 the Common Rule, 28 the GDPR, 29 and the California Consumer Privacy Act 30 with regards to handling of patient data. Code sharing and public answers promote transparency and reproducibility without disclosing patient or institutional information.

DISCUSSION
Our approach is practical and generalizable: The network can be repurposed to any other disease of interest, as it is not based exclusively on data elements deemed relevant for COVID-19. Because privacy protection is at the core of our network, a wide range of institutions can participate. We provide a rapidly deployable and reproducible alternative or complement to centralized registries of EHR data that allows healthcare institutions to stay in control of their data.
This study has advantages but also some limitations. The advantages are that we can, in relatively short time, publicly post answers, using data from a spectrum of institutions with different levels of information technology baselines and expertise in standardized data models and vocabularies, institutional policies, and state and federal regulations. Because we keep data locally and only consult data elements that are necessary to answer specific questions, this approach has a very low risk of privacy breach. However, for this reason, our approach does not provide answers in real time. We made this practical decision to quickly collect aggregate counts and statistics near real time within existing institutional policies and OMOP implementation to meet the clinical need of a rapidly spreading pandemic while preserving patient privacy. A real-time query with a fully automated process would be ideal, but this necessitates a long process of The number of patients who were taking any antihypertensives was zero Unmatched group sum The total sums of patient count in age groups and race groups were different even when all cell counts were greater than 10 Version mismatch The version of the template query was revised after the query result was uploaded interinstitution agreement, amendments to the institutional policies, and a complete harmonization of EHR data across all sites. The use of OMOP CDM data is dependent on recurring ETL processes on each site, which presents a challenge to presenting real-time data. Additionally, as opposed to registries that typically focus on a single disease or condition, we have comparator data from other patients.
Institutional privacy is also preserved because all public answers combine the aggregate data from at least three Responding Sites. Making concept definitions, query code, and results publicly available enhances reproducibility. A major advantage is that existing registries or consortia can serve as additional sites to help answer certain questions. However, the limitations are inherent from con- sidering all sites equal when formulating a final answer, as regional or institutional practice variations are not represented in the answers. Additionally, the distributed nature of the consortium adds a requirement for educating some system leaders on distributed analytics. A specific limitation of our current consortium is the preponderance of institutions based in California: 67%, or 17.5% of COVID-19 patients. This was a convenience sample of organizations that had a history of collaboration. We are currently adding two new large health systems. One system is in the Northeast United States, and another is in the Southeast. To display changes over time and to help users compare our results to public results, new SQL code has been developed. Additionally, the increasing use of automated stored procedures will help reduce the manual process.
We believe that our Covid-19 Clinical Data Consult is a tool for achieving rapid and robust responses to COVID-19 questions submitted by the public or by researchers. We can achieve those goals by combining a transparent, privacy-preserving code-sharing workflow with the use of harmonized distributed data. A vision for the future in which there is convergence of data services would include interoperability with other efforts, including federated mul-tivariate analyses across different consortiums (eg, R2D2, 4CE, and N3C).

CONCLUSION
Instead of centralizing data at the Consortium Hub, we focus on interpreting and clarifying the research questions in order to determine the data elements required. Our teams analyze these data elements to generate aggregate statistics at the multiple institutions, documenting the specific version of SQL code executed at a specific time point to generate their answers. In addition to basic counts and proportions to adjust for confounders, we use distributed multivariate analyses to estimate risk-adjusted odds ratios. This is done in a synchronized fashion for iterative federated algorithms, such as one previously reported for building a logistic regression model. We have shown previously that a model obtained this way is identical to one built using data that are centralized in a single location. We made SQL codes, cohort definitions, and concept sets publicly available at https://github.com/ DBMI/R2D2-Public. We invite other institutions, consortia, and registries worldwide to join us at https://covid19questions.org.