Assessing the practice of data quality evaluation in a national clinical data research network through a systematic scoping review in the era of real-world data

Abstract Objective To synthesize data quality (DQ) dimensions and assessment methods of real-world data, especially electronic health records, through a systematic scoping review and to assess the practice of DQ assessment in the national Patient-centered Clinical Research Network (PCORnet). Materials and Methods We started with 3 widely cited DQ literature—2 reviews from Chan et al (2010) and Weiskopf et al (2013a) and 1 DQ framework from Kahn et al (2016)—and expanded our review systematically to cover relevant articles published up to February 2020. We extracted DQ dimensions and assessment methods from these studies, mapped their relationships, and organized a synthesized summarization of existing DQ dimensions and assessment methods. We reviewed the data checks employed by the PCORnet and mapped them to the synthesized DQ dimensions and methods. Results We analyzed a total of 3 reviews, 20 DQ frameworks, and 226 DQ studies and extracted 14 DQ dimensions and 10 assessment methods. We found that completeness, concordance, and correctness/accuracy were commonly assessed. Element presence, validity check, and conformance were commonly used DQ assessment methods and were the main focuses of the PCORnet data checks. Discussion Definitions of DQ dimensions and methods were not consistent in the literature, and the DQ assessment practice was not evenly distributed (eg, usability and ease-of-use were rarely discussed). Challenges in DQ assessments, given the complex and heterogeneous nature of real-world data, exist. Conclusion The practice of DQ assessment is still limited in scope. Future work is warranted to generate understandable, executable, and reusable DQ measures.


INTRODUCTION
There has been a surge of national and international clinical research networks (CRNs) curating immense collections of real-world data (RWD) from diverse sources of different data types such as electronic health records (EHRs) and administrative claims among many others. One prominent CRN example is the national Patient-Centered Clinical Research Network (PCORnet) 1,2 funded by the Patient-Centered Outcomes Research Institute (PCORI) that contains more than 66 million patient data across the United States (US). 3 The OneFlorida Clinical Research Consortium 4 first created in 2009 is 1 of the 9 CRNs contributing to the national PCORnet. The OneFlorida network currently includes 12 healthcare organizations that provide care for more than 60% of Floridians through 4100 physicians, 914 clinical practices, and 22 hospitals covering all 67 Florida counties. 5 The centerpiece of the OneFlorida network is its Data Trust, a centralized data repository that contains longitudinal and robust patient-level records of approximately15 million Floridians from various sources, including Medicaid and Medicare programs, cancer registries, vital statistics, and EHR systems from its clinical partners. Both the amount and types of data collected by OneFlorida is staggering.
Rising from the US Food and Drug Administration (FDA) Realworld Evidence (RWE) program, RWD such as those in the One-Florida are increasingly important to support a wide range of healthcare and regulatory decisions. 6,7 RWD are playing an increasingly critical role in various other national initiatives, such as the learning health systems, 8,9 comparative effectiveness research, 10 and programmatic clinical trials. 11 Nevertheless, concerns over the quality of RWD, where data quality (DQ) issues, such as incompleteness, inconsistency, and accuracy, are widely reported and discussed. 12,13 To maximize the utility of RWD, data quality should be systematically assessed and understood.
The literature on DQ assessment is rich with a number of DQ frameworks developed over time. Wang et al (1996) 14 proposed a conceptual framework for assessing DQ aspects that are important to data consumers. McGilvray (2008) 15 described 10 steps to quality data, where DQ assessment is an important step. Chan et al (2010) 16 conducted a literature review on EHR DQ and summarized 3 DQ aspects: accuracy, completeness, and comparability. Nahm (2012) 17 defined 10 DQ dimensions (eg, accuracy, currency, completeness) specific to clinical research with a framework for DQ practice. Kahn et al (2012) 18 proposed the "fit-for-use by data consumers" concept with a process model for multisite DQ assessment. Weiskopf et al (2013a) 19 provided an updated literature review on EHR DQ and identified 5 DQ dimensions: completeness, correctness, concordance, plausibility, and currency. They then focused on completeness in their follow up work (ie, Weiskopf et al [2013b] 20 ). Liaw et al (2013) 21 summarized the most reported dimensions in DQ assessment. Zozus et al (2014) 22 conducted a literature review to identify DQ dimensions that affect the capacity of data to support research conclusions the most. Johnson et al (2015) 23 developed an ontology to define DQ dimensions to enable automated computation of DQ measures. Garc ı A-de-Le on-Chocano (2015) 24 described a DQ assessment framework and constructed a set of processes. Kahn et al (2016) 25 developed the "harmonized data quality assessment terminology" that organizes DQ assessment into 3 categories: conformance, completeness, and plausibility. Reimer et al (2016) 26 developed a framework based on the 5 DQ dimensions from Weiskopf et al (2013a), 19 with a focus on longitudinal data repositories. Khare et al (2017) 27 summarized DQ issues and mapped to the har-monized DQ terms. Smith et al (2017) 28 shared a framework for assessing the DQ of administrative data. Weiskopf et al (2017) 29 developed a 3x3 DQ assessment guideline, where they selected 3 core dimensions from the 5 dimensions they defined in Weiskopf et al (2013a) 19 and each dimension has 3 core DQ constructs. Lee et al (2018) 30 modified the dimensions defined in Kahn et al (2016) 25 to support specific research tasks. Feder (2018) 31 19 Nordo et al (2019) 33 proposed outcome metrics in the use of EHR data, including measures related to DQ. Bloland et al (2019) 34 offered a framework that describes immunization data in terms of 3 key characteristics (ie, data quality, usability, and utilization). Henley-Smith et al (2019) 35 derived a 2-level DQ framework based on Kahn et al (2016). 25 Charnock et al (2019) 36 conducted a systematic review focusing on the importance of accuracy and completeness in secondary use of EHR data.
However, the literature on DQ assessment of EHR data is due for an update as the latest review article on this topic is from Weiskopf et al (2013a) 19 that covered the literature before 2012. Further, few studies have assessed the practice of DQ assessment in large clinical networks. Callahan et al (2017) 37 mapped the data checks in 6 clinical networks to their DQ assessment framework-the harmonized data quality assessment by Kahn et al (2016). 25 One of the networks Callahan et al (2017) 37 assessed is the Pediatric Learning Health System (PEDSnet), which also contributes to the national PCORNet like OneFlorida. Qualls et al (2018), 38 from the PCORnet data coordinating center, presented the existing PCORnet DQ framework (ie, called "data characterization"), where they focused on only 3 DQ dimensions: data model conformance, data plausibility, and data completeness, initially with 13 DQ checks. They reported that the data characterization process they put in place has led to improvements in foundational DQ (eg, elimination of conformance errors, decrease in outliers, and more complete data for key analytic variables). As our OneFlorida network contributes to the PCORnet, we participate in the data characterization process. The data characterization process in PCORnet has evolved significantly since Qualls et al (2018). 38 Thus, our study aims to identify gaps in the existing PCORnet data characterization process. To have a more complete picture of DQ dimensions and methods, we first conducted a systematic scoping review of existing DQ literature related to RWD. Through the scoping review, we organized the existing DQ dimensions as well as the methods used to assess these DQ dimensions. We then reviewed the DQ dimensions and corresponding DQ methods used in the PCORnet data characterization process (8 versions since 2016) to assess the DQ practice in PCORnet and how it has evolved.

MATERIALS AND METHODS
We followed the typical systemic review process to synthesize relevant literature to extract DQ dimensions and DQ methods, mapped their relationships, and mapped them to the PCORnet data checks. Throughout the process, 2 team members (TL and AL) independently carried out the review, extraction, and mapping processes in each step, and disagreements between the 2 reviewers were first resolved through discussion with a third team member (JB) first and then the entire study team if necessary. We followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline and generated the PRIMSA flow diagram.
A systematic scoping review of data quality assessment literature We started with 3 widely cited core references on EHR DQ assessment, including 2 review articles from Chan et al (2010) 16 and Weiskopf et al (2013a), 19 and 1 DQ framework from Kahn et al (2016). 25 First, we summarized and mapped the DQ dimensions in these 3 core references. We merged the dimensions that are similar in concept but named differently. For example, Chan et al (2010) 16 defined "data accuracy" as whether the data "can accurately reflect an underlying state of interest," while Weiskopf et al (2013a) 19 defined it as "data correctness" (ie, "whether the data is true"). Then we synthesized the methods used to assess these DQ dimensions. Weiskopf et al (2013a) 19 summarized the DQ assessment methods, while Chan et al (2010) 16 16 defined "completeness" as "the level of missing data" and discussed various studies that have shown the variation in the amount of missing data across different data areas (eg, problem lists and medication lists) and clinical settings, while Kahn et al (2016) 25 provided examples on how to measure "completeness" (eg, "the encounter ID variable has missing values"). Thus, we mapped "completeness" to the method of checking "element presence" (ie, "whether or not desired data elements are present") defined in Weiskopf et al (2013a). 19 We created new categories if the measurement examples cannot be mapped to existing methods in Weiskopf et al (2013a). 19 For example, Kahn et al (2016) 25 defined a "conformance" dimension that cannot be mapped to any of the methods defined in Weiskopf et al (2013a). 19 Thus, we created a new method term "conformance check" to assess "whether the values that are present meet syntactic or structural constraints." Kahn et al (2016) 25 gave examples of conformance check such as the variable sex shall only have values: "Male," "Female," or "Unknown." We then reviewed the literature cited in the 3 core references. Chan et al (2010) 16 25 is based on 9 other frameworks (however, full text of 1 framework is not available) and the literature review by Weiskopf et al (2013a). 19 For completeness, we extracted the extra dimensions that were mentioned in the 8 frameworks but not included in the framework from Kahn et al (2016). 25 We also summarized the methods for these additional dimensions according to the measurement examples given in the original frameworks.
We then reviewed the articles that were cited in the 2 core review papers: Chan et al (2010) 16 and Weiskopf et al (2013a). 19 We mapped the dimensions and methods mentioned in these articles to the ones we extracted from Kahn et al (2016). 25 During this process, we revised the definitions of the dimensions and methods to make them more inclusive of the different literature.
Weiskopf et al (2013a) 19 is the latest review article that covers DQ literature before January 2012. Thus, we conducted an additional review of DQ assessment literature published after 2012 to February 2020. We identified 2 group of search keywords (ie, DQrelated and EHR-related keywords) mainly from the 3 core references. The search strategy including the keywords is detailed in the Supplementary Appendix A. An article was included if it assessed the quality of data derived from EHR systems using clearly defined DQ measurements (even if the primary goal of the study was not to assess DQ).
We then extracted the DQ dimensions and methods from these new articles, merged the ones that are similar to the existing ones, and created new dimensions and methods if necessary. After this process, we created a comprehensive list of dimensions, their concise definitions, and the methods commonly used to assess these DQ dimensions.
Map the PCORnet data characterization checks to the data quality dimensions and methods We reviewed the measurements in the PCORnet data checks (from version 1 published in 2016 to version 8 as of 2020) 38,39 and mapped them to the dimensions and methods we summarized above. Two reviewers (TL and AL) independently carried out the mapping tasks, and conflicts were resolved by a third reviewer (JB) through group discussions.

Data quality dimensions and assessment methods summarized from the 3 core references
Data quality dimensions Overall, we extracted 12 dimensions (ie, currency, correctness/accuracy, plausibility, completeness, concordance, comparability, conformance, flexibility, relevance, usability/ease-of-use, security, and information loss and degradation) from the 3 core references and then mapped the relationships among them.
Chan et al (2010) 16 conducted a systematic review on EHR DQ literature from January 2004 to June 2009 focusing on how DQ affects quality of care measures. They extracted 3 DQ aspects: (1) accuracy, including data currency and granularity; (2) completeness; and (3) comparability.
Weiskopf et al (2013a) 19 performed a literature review of EHR DQ assessment methodology, covering articles published before February 2012. They identified 27 unique DQ terms/dimensions. After merging DQ terms with similar definitions and excluding dimensions that have no measurement (ie, how the DQ dimension is measured), they retained 5 dimensions: (1) completeness, (2) correctness, (3) concordance, (4) plausibility, and (5) currency.
Kahn et al (2016) 25 proposed a DQ assessment framework for secondary use of EHR data, consisting of 3 DQ dimensions: (1) conformance with 3 subcategories: value conformance, relational conformance, and computational conformance; (2) completeness; and (3) plausibility with 3 subcategories: uniqueness plausibility, atemporal plausibility, and temporal plausibility. Each DQ dimension can be assessed in 2 different DQ assessment contexts: verification (ie, "how data values match expectations with respect to metadata constraints, system assumptions, and local knowledge"), and validation (ie, "the alignment of data values with respect to relevant external benchmarks").
For comprehensiveness, we also reviewed the 8 DQ frameworks that were cited by Kahn et al (2016) 25    Data quality assessment methods A total of 10 DQ assessment methods were identified: 7 from Weiskopf et al (2013a), 19 19 extracted the DQ measurements used and mapped them to the 12 DQ dimensions and 10 DQ assessment methods. Through this process, we revised the definitions of the DQ dimensions and methods if necessary. Figure 1A shows our review process.
Further, since the review from Weiskopf et al (2013a) 19 only covered the literature before 2012, we conducted an additional review of the literature on EHR DQ assessment published from 2012 up until February 2020. Figure 1B illustrates our literature search process following the PRISMA flow diagram.
Through this process, we identified 1072 publications and then excluded 743 articles through title and abstract screening. During the full-text screening, 172 articles were excluded because either (1) the full text was not accessible (n ¼ 19); (2) the paper was not rele-vant to DQ, or the paper lacks sufficient details on what methods were used to assess DQ (n ¼ 147); or (3) the data of interest were not derived from clinical data systems (n ¼ 6). At the end, 157 new articles were included, out of which 139 were individual studies and 16 were review articles or frameworks. Four of the 16 review/framework articles were already included the 3 core references, thus, effectively, we identified 12 new review or framework articles. We effectively reviewed 139 new individual DQ assessment studies published after 2012 until February 2020. The list of all reviewed articles is in Supplementary Appendix B.
Review of the newly identified DQ frameworks and review articles From the 12 newly identified DQ frameworks or reviews, we extracted the DQ dimensions and assessment methods and mapped them to the existing 12 DQ dimensions and 10 methods we extracted from the 3 core references. We refined the original definitions if necessary. We did not identify any new DQ methods, but we identified 2 new DQ dimensions: (1) consistency (ie, "pertains to the constancy of the data, at the desired degree of detail for the study purpose, within and across databases and data sets" from Feder [2018] 31 ) and (2) 31 covers a broader and more abstract concept pertaining to the constancy (ie, "the quality of being faithful and dependable") of the data.

A summary of DQ dimensions and assessment methods
We summarized the 14 DQ dimensions and 10 DQ assessment methods and mapped the relationships among them as shown in Fig 25 we categorized the DQ dimensions and methods into 2 contexts: verification (ie, can be assessed using the information within the dataset or using common knowledge) and validation (ie, can be assessed using external resources such as compared with external data sources and checked against data standards). However, 6 DQ dimensions (ie, flexibility, relevance, usability, security, information loss and degradation, and understandability/interpretability) and 2 DQ assessment methods (ie, qualitative assessment and security analyses) cannot be categorized into either context. In the broader DQ literature, there is also the concept of intrinsic DQ versus extrinsic DQ. 14,40 The intrinsic DQ denotes that "data have quality in its own right" 14 and "independent of the context in which data is produced and used," 40 while the extrinsic DQ, although not explicitly defined, are more sensitive to the external environments, considering the context of the task at hand (ie, contextual DQ 40 ) and the information systems that store and deliver the data (ie, accessibility DQ and representational DQ 40 ) In our context, D1-D7 are more related to intrinsic DQ; while D8-D14 may fall into the extrinsic DQ category. Note that there is also literature that defines intrinsic DQ versus extrinsic DQ in terms of how they can be assessed (ie, "this measure is called intrinsic if it does not require any additional data besides the dataset, otherwise it is called extrinsic" 41 ); however, such definitions may be incomplete and imprecise. For example, correctness/accuracy (D2) is part of the intrinsic DQ defined in Strong et al (1997) 40 but can be assessed with external datasets in the context of validation. Tables 1 and 2 show the definitions and the reference frameworks or reviews from which we extracted the definitions for DQ dimensions and DQ methods, respectively. Table 3 shows the result of mapping existing PCORnet data characterization checks to the 14 DQ dimensions and 10 DQ assessment methods.

DISCUSSION
Evident from the large number of studies we identified-3 review articles, 20 DQ frameworks, and 226 DQ relevant studies-the literature on the quality of real-world clinical data, such as EHR and claims, for secondary research use is rich. Nevertheless, the definitions of and the relationships among the different DQ dimensions are not as clear as they could have been. For example, even though we merged accuracy with correctness into 1 DQ dimension as accuracy/correctness (D2), the original accuracy dimension (ie, "the extent to which data accurately reflects an underlying state of interest includes timeliness and granularity") as defined by Chan et al (2010) 16 ) actually contains both correctness (ie, "data were considered correct when the information they contained was true") and plausibility (ie, "actual values as a representation of a real-world")  14 Johnson et al (2015), 23 Wang et al (1996) 14

Understandability/ Interpretability
The ease with which a user can understand the data. Smith et al (2017) 28 Smith et al (2017), 28 19 refers to comparing a data element to an external authoritative resource (eg, comparing the prevalence of diabetes patients calculated from an EHR system to the general diabetes prevalence of that area), while validity check defined in Kahn et al (2016) 25 refers to whether the value of a data element is out of the normal range (ie, outliers).

M5
Conformance check Check the uniqueness of objects which should not be duplicated; the dataset agreement with prespecified or additional structural constraints Kahn et al (2016); 25 and the agreement of object concepts and formats granularity) between 2 or more data sources.

M8
Gold standard Data value and presence in the dataset is the same as the value and presence from trusted reference standards or datasets. If the data is extracted from paper record in a rigorous fashion, then it's a gold standard (eg, manual chart review).
Bloland et al (2019), 34 Terry et al (2019), 32 Feder SL (2018), 31 Kahn et al (2016), 25 Weiskopf et al (2013a), 19 Nahm (2012)   The reason maybe multifold. First, the data from different sites of a CRN are heterogeneous in syntax (eg, file formats), schema (eg, data models and structures), and even semantics (eg, meanings or interpretations of the variables). This is not only because of the difference between different EHR vendors (eg, Cerner vs Epic), but also the difference in the implementation of the same EHR vendor Data in the PCORnet follows the PCORnet common data model (CDM). Both the PCORnet CDM and the PCORnet data checks specifications are available at https://pcornet.org/data-driven-common-model/. system. For example, Epic's flexibility in being able to create arbitrary flow sheets to meet different use cases also created inconsistency in data capturing at the data sources. Common data models (CDMs) and common data elements are common approaches to address these inconsistencies through transforming the source data into an interoperability common data framework. However, it is worth noting that standardization and harmonization of heterogeneous data sources are always difficult after the fact, when the data have already been collected. For example, in the OneFlorida network, although partners are required to provide a data dictionary of their source data, the units of measures are often neglected by the partners, leading to situations such as the average heights of patients being vastly higher than conventional wisdom. Our investigation of this DQ issue revealed that certain partners used centimeters rather than inches (as dictated by the PCORnet CDM) as the unit of measure. These "human" errors are inevitable, where a rigorous DQ assessment process is critical to identify these issues. Second, even though DQ is widely recognized as an important aspect, it is difficult to have a comprehensive process to capture all DQ issues from the get-go. The approach that the PCORnet takes is to have different levels of DQ assessment processes, where the general data checks (as shown in Table 3) are used to capture common and easy-to-catch errors while a study-specific data characterization process is used to inform whether the data at hand can inform a study's specific objectives. Third, some DQ dimensions and DQ methods, although easy to understand in concept, are difficult to put in place and execute in reality. For example, usability/ease-of-use (D10) and security (D11), although straightforward to understand, lack well-defined executable measures. However, these DQ dimensions are still important aspects of DQ, and more efforts on methods and tools to assess DQ dimensions, such as flexibility (D8), usability/ease-of-use (D10), security (D11), and understandability/interpretability (D14), are needed to fill these knowledge gaps. There are also a few studies 21,23 that attempted to develop ontologies of DQ to "enable automated computation of data quality measures" and to "make data validation more common and reproducible." However these efforts, although much needed, have not led to wide adoption. The "harmonized data quality assessment terminology" proposed by Kahn et al (2016), 25 although not comprehensive, covers common and important aspects that matter in DQ assessment practice. Further expansion is warranted. Another interesting observation is that out of the 226 DQ assessment studies, only 1 study 42 discussed the importance of reporting DQ assessment reports. It recommends, and we agree, that "reporting on both general and analysis-specific data quality features" are critical to ensure transparency and consistency in computing, reporting, and comparing DQ of different datasets. These aspects of DQ assessment also deserve further investigations.

LIMITATIONS
First, we only used PubMed to search for relevant articles, thus, we may have missed some potentially relevant studies indexed in other databases (eg, Web of Science). Second, our review focused on qualitatively synthesizing DQ dimensions and DQ assessment methods but did not go into the details about how these DQ dimensions and methods can be applied. Further comprehensive investigation on which DQ checks and measures are concrete and executable is also warranted.

CONCLUSIONS
Our review highlights the wide awareness and recognition of DQ issues in RWD, especially EHR data. Although the practice of DQ assessment in exists, it is still limited in scope. With the rapid adoption and increasing promotion of research using RWD, DQ issues will be increasingly important and call for attention from the research communities. However, different strategies of DQ may be needed given the complex and heterogeneous nature of RWD. DQ issues should not be treated alone but rather in full consideration with other data-related issues, such as selection bias among others. The addition of reporting DQ into the now widely recognized FAIR (ie, Findability, Accessibility, Interoperability, and Reuse) data principles may benefit the broader research community. Nevertheless, future work is warranted to generate understandable, executable, and reusable DQ measures and their associated assessments.