Electronic health records (EHRs), data generated and collected during normal clinical care, are increasingly being linked and used for translational cardiovascular disease research. Electronic health record data can be structured (e.g. coded diagnoses) or unstructured (e.g. clinical notes) and increasingly encapsulate medical imaging, genomic and patient-generated information. Large-scale EHR linkages enable researchers to conduct high-resolution observational and interventional clinical research at an unprecedented scale. A significant amount of preparatory work and research, however, is required to identify, obtain, and transform raw EHR data into research-ready variables that can be statistically analysed. This study critically reviews the opportunities and challenges that EHR data present in the field of cardiovascular disease clinical research and provides a series of recommendations for advancing and facilitating EHR research.
Over few past years, ‘big data’ has become a frequently used catchall phrase for research approaches involving the use of complex, large-scale datasets.1,2 There are many types of data that may fit this description, but within the sphere of clinically oriented research this term is often considered synonymous to electronic health record (EHR) data, or electronic medical record data. The powerful potential of these data for advancing biomedical research has been recognized by many.3–5 Funders in both the USA and the UK have recently made substantial investments in the area, such as the USD$215 million Precision Medicine Initiative6 announced by the US government and Genomics England, aiming to sequence 100 000 whole genomes during routine clinical care.7 Additionally, funding organizations are actively encouraging research utilizing large-scale biomedical data through specific initiatives. The Big Data To Knowledge programme8 was established by the National Institutes of Health (NIH) in 2012 to address the challenges and opportunities presented by big biomedical data through the provision of seed funding for biomedical data science-based research, methods, and training material development. In the UK, a consortium of 10 UK government and charity funders, led by the Medical Research Council have committed over £90 million across several initiatives that are aimed at supporting translational research using big data such as the national Farr Institute of Health Informatics Research,9 the UK Health Informatics Research Network, and the Medical Bioinformatics Initiative. The amount of EHR data being digitally generated and collected is vast and rapidly expanding, and presents multiple opportunities that have the potential to transform medical practice and cardiovascular research across all stages of translation.
However, big data is not a panacea for all research problems, and for many researchers the path from big data to clinical impact for a specific research question is unclear. There are many factors that must be considered when planning to use EHR data for research related not just to ethical and policy issues raised by combining data sources10,11 but also the logistical and analytical decisions the process entails. One of the major impediments to the use of EHRs for research is that is the data they contain differs from data collected in a conventional cohort study or randomized controlled trial (RCT) in terms of both why and how it is recorded, and requires substantial processing before it can be statistically analysed. These data are generated and recorded throughout the patient pathway during interactions with primary, secondary, and tertiary healthcare providers. Data from specialized disease registries, which were originally set up for auditing clinical standards and benchmarking quality improvement initiatives, may also be incorporated. These different sources also record information in different ways. Electronic health record data can be structured [e.g. diagnosis recorded using medical classification systems such as the International Classification of Diseases—10th revision (ICD-10)12 or Systematic Nomenclature of Medicine – Clinical Terms (SNOMED-CT13)] or unstructured (e.g. textual narrative in clinical notes or coronary angiography reports in hospital information systems14). Electronic health records are also increasingly including cardiovascular imaging data from procedures such as echocardiography, angiography, magnetic resonance imaging, or computed tomography.15 For all sources of information, data collection will have been motivated by clinical care, administrative, or other reasons, and clinical information will be recorded using a variety of ways. The research-user is faced with substantial missing or incomplete information, data collected at irregular time-points, information that may be temporally inconsistent or conflicting, and potentially the task of integrating and harmonizing information contained in multiple sources.
These challenges are not insurmountable and do not mean that EHR data cannot be widely used for research, but do require a clear identification of research areas that can best leverage EHR data, and the development of tools that smooth the path from research question to research result.
Research opportunities well-placed to leverage electronic health record data
High-resolution observational cohort studies
Linkage of multiple EHR data sources permits the creation of large-scale cohorts of patients for whom extensive follow-up data are already available. This allows researchers to answer questions that reliance upon traditional investigator-led cohort studies would otherwise make impossible due to the scale, diagnostic resolution, timeframe, or cost. In addition, it allows researchers to define and examine the entire patient journey, from early presentations of non-acute manifestations through the various syndrome transitions to cardiac (or non-cardiac) death. This enables them to resolve the time sequence, examine and understand, the aetiological and prognostic differences between different coronary disease phenotypes.16
Chung et al.17 were able to take advantage of available EHR data in this way to conduct a comparative effectiveness study of acute coronary care on an international scale. Currently, Sweden and the UK are the only two countries in the world with ongoing, national registries for acute coronary syndrome events that cover all hospital care. Using these data, the authors showed that in the UK, compared with Sweden, 30-day mortality following acute myocardial infarction (AMI) was substantially higher, and that uptake of effective treatment was slower. The richness of the data meant that a substantial amount of clinical information could be incorporated into the casemix, including demography, risk factor comorbidity, and pre-hospital treatment. The researchers were also able to determine that diagnoses made in the two countries were comparable by examining troponin values and propensity to make a diagnosis. The results from this study are thus more robust than those based on a simple comparison of mortality rates, or focused on data from bespoke studies undertaken in hospitals that may not be representative of the broader healthcare system.
Electronic health record cohorts can also be used to make timely contributions to debates of clinical importance, such as the controversy over the relationship between varenicline and adverse cardiovascular events. In 2011, a meta-analysis of 14 RCTs raised concerns that the use of varenicline for smoking cessation may increase risk for adverse cardiovascular events (ischaemia, arrhythmia, congestive heart failure, sudden death, or cardiovascular-related death).18 Three subsequent meta-analyses of RCTs did not find a significant association.19–21 However, the question remains controversial, partly due to disagreements over analytical methods used in these studies, but also because meta-analyses are limited to the analysis of existing studies.20,22,23 Svanström et al.24 were able to rapidly contribute new data to the debate by investigating the question in a cohort made up of the EHR data of over 35 000 Danish individuals who used either varenicline or bupropion for smoking cessation. In this observational study, published in 2012, there was no evidence for a higher number of adverse vascular events (acute coronary syndrome, ischaemic stroke, and cardiovascular death) in patients using varenicline. It would not have been feasible to take a comparable traditional cohort study from study design to publication within a similar timeframe, especially as very large number of patients would be required to ensure sufficient outcome numbers (only 117 were observed among the 35 000 patients in the EHR study).
The capacity to investigate novel research questions has generally been limited by available data and funding for obtaining new data, but EHR data can potentially be used to address this problem. The relationship between auto-immune inflammatory conditions and atrial fibrillation (AF) is one example where EHR data have been able to fill a research niche. Although there is substantial research interest in this area,25 many of the large cardiovascular cohort studies (e.g. Framingham26) have limited data available on inflammatory conditions as this was not part of the original study design. However, researchers in the UK, USA, and Denmark have been able to use EHR resources to explore this research area using very large samples, finding associations with an increased risk of AF and a range of conditions including rheumatoid arthritis and psoriasis.27–30 Other researchers have taken an even broader, non-hypothesis-driven approach, using advanced computation techniques that consider any and all disease information available in EHR data to identify novel associations between diseases.31 The costs associated with using EHR data for these studies would have been much lower than comparable data collection, making them a cost-effective entry point into new areas of cardiovascular research.
Enhanced clinical trials
There is growing concern that the current model of discovering new interventions, evaluating them through RCTs, and implementing them in clinical care is significantly inefficient. The translation process itself is taking too long, with an average figure of 17 years reported in some cases.32 Additionally, the number of new drugs introduced to the market per year has been broadly flat since the 1950s yet the costs have steadily grown,33 and the cost of bringing a new licensed drug to the market has been estimated between USD$5 and 11 billion.34
In cardiovascular diseases (CVD), the problem is more acutely manifested through problems observed in the current clinical trials pipeline. There is a lack of contemporary and representative population data that can be utilized to draw accurate estimates of events and inform the selection of appropriate RCT primary and secondary endpoints. Clinical trials are often conducted in highly selected populations that are not necessarily representative of the populations presented in routine clinical care and as such, results obtained have limited generalizability and external validity.35 For example, the clinical characteristics, treatments, and inpatient outcomes of patients enrolled in a large trial of acute heart failure (Acute Study of Nesiritide in Decompensated Heart Failure) were found to be significantly different that those found in a contemporary disease registry.36 Furthermore, despite their growing importance in CVD research, non-drug interventions such as interventions based on clinical algorithms and decision support tools are not systematically evaluated through clinical trials since the process of randomization and outcome ascertainment is not seamlessly integrated into the clinical care pathway.
This has had a significantly negative impact on clinical trial conduct and findings. For example, recently there have been several late drug failures occurring within phase III clinical trials of therapeutic agents each costing several hundred million USD$. HDL-cholesterol raising agents such as niacin, fibrates, and cholesteryl ester transfer protein failed to reduce all-cause mortality, coronary heart disease mortality, and myocardial infarction event rates in patients treated with statins.37 Likewise, heart rate lowering agents such as ivabradine when introduced to patients with stable coronary artery disease without clinical heart failure failed to improve cardiovascular mortality and non-fatal myocardial infarctions rates.38
There is growing optimism that EHR can enrich RCT design, delivery, and follow-up. Electronic health record can offer real-world phenotype-rich data that can directly inform trial design, enable the identification of optimal target populations, and offer accurate event rate estimates similar to those encountered in clinical care. The entire trial conduct pipeline, from recruitment at the point of care to randomization and adverse event capture can be integrated with routine clinical care enabling the cost-effective and efficient trialling of non-drug interventions. Additionally, EHR can provide richer contemporary data on trial participants at a fraction of the cost thus enabling the generalization of trial results to external populations.39
For example, the Thrombus Aspiration during ST-Segment Elevation Myocardial Infarction trial40 for assessing the clinical effect of routine intracoronary thrombus aspiration before primary percutaneous coronary intervention in patients with ST-segment elevation myocardial infarction recruited patients though the Swedish Coronary Angiography and Angioplasty Registry and utilizing national EHR and registry data for defining trial endpoints. Finally, EHR data provide valid, complete, long-term follow-up of phase III trials that would otherwise be too costly and complex to establish and too narrow in focus.41 While EHR offer a rich data-scaffolding for designing and implementing clinical trials, significant challenges still exist, mainly around information governance and recruitment of clinicians as outlined in the evaluation by van Staa et al.42,43
Challenges in the pathway from electronic health record data to research results
Although the benefits of using EHR data for research are potentially large, the widespread use of EHR data is hampered by the fact that there are currently a number of additional steps, and many associated queries, in the pathway from research question to results and publication. As an example, consider a research project using existing data to investigate whether there is a relationship between gender and onset of AF. Most projects would involve applying standard analytical techniques to a bespoke investigator-led cohort of healthy individuals followed-up for cardiovascular conditions including AF (e.g. The Framingham Heart Study). For an existing dataset, only relatively minimal data preparation would be required before analyses could be conducted and data are often provided with detailed documentation. However, using EHR data to answer the same question would require a number of additional preparatory steps before statistical analyses could be conducted. Broadly speaking, these relate to: (i) identifying the EHR source(s) that contain the data needed for the research question; (ii) developing strategies for extracting the required information from the data source(s), and combining it where necessary; and (iii) creating a dataset that is ready for analysis using standard statistical techniques (see Figure 1).
What electronic health record data sources are available?
The availability of diverse data sources, including EHR, is rapidly expanding, making the identification of relevant sources for a single project overwhelming. Selecting appropriate data sources for research is dependent upon knowledge of the patients included (e.g. inpatients, ambulatory care, and specialist treatment), the types of data recorded (e.g. diagnosis, prescriptions, test results, and procedures), and the format of those data (e.g. diagnostic codes, imaging, and free text), but often much of this can be difficult to determine in detail until data access has been granted. A recent Wellcome Trust report on the discoverability of EHR and other biomedical datasets for research44 found that for the vast majority of sources, no systematic method is used to capture, curate, and display information about the data contained in them, or to provide guidance on the information governance restrictions attached to them which determine how they can be accessed and used for research. The limited use of standardized methods (e.g. metadata) for describing such information hinders recognition of the limitations and opportunities these data sources present, and potentially results in under-utilization of data sources due to lack of knowledge about what they contain (Box 1).
Electronic health records. Electronic health records are data generated and recorded during routine clinical care. Electronic health records are diverse and encompass nationally and regionally available structured and unstructured data from primary care, hospitals, administrative data, and disease, procedure and death registries; increasingly including genomic, imaging, and patient sensor data.
Medical ontology. A structured controlled vocabulary of medical concepts and their semantic relations used to record, store, and transmit medical knowledge and patient-related clinical information efficiently.
Metadata. Metadata are data that describe aspects around a particular data element. For an EHR source, metadata can include information about the manner in which the data get generated and recorded, the medical ontologies used to record information, and the methods by which researchers can access the data for research.
Phenotyping. In the context of EHR, phenotyping is defined as the process of creating algorithms that define an observable trait (physical or biochemical) such as a clinical condition within EHR data.
However, overcoming this challenge is worth the additional effort, as combining data from multiple sources strengthen EHR-based cardiovascular research. For example, Herrett et al. explored the completeness of recording for AMI in four EHR sources: primary care (Clinical Practice Research Datalink; CPRD45), hospital admissions (Hospital Episode Statistics; HES46), a natonal MI disease registry (Myocardial Ischaemia National Audit Project47), and national mortality data (Office of National Statistics; ONS48). Compared with the disease registry, which was treated as the gold standard data source, none of the other data sources captured all MI events and consequently incidence rates based on data from a single source were underestimated by 25–50%.49 This finding is not limited to AMI; a similar investigation of AF diagnoses found that only ∼40% of the 72 793 AF patients identified had a diagnosis recorded in both primary and secondary care.50
Thus, for our example research question regarding gender and AF, we would likely decide to combine multiple EHR data sources, such as CPRD, HES, and ONS. This would enable us to use a sample of individuals broadly representative of the UK general population, and would include a more representative set of AF cases as diagnoses made in both primary and secondary care would be identified. However, individual access applications would need to be made for each EHR source prior to linkage of the different data sources, and information about what is contained within each would currently be limited to knowledge of the medical ontologies used to code clinical information.
How can I define clinical conditions in electronic health record data?
Once the relevant data source(s) have been identified, researchers face another challenge: how to determine which patients have been diagnosed with a particular condition. Extracting phenotypic information (i.e. disease status), a process known as phenotyping, is a time-consuming and challenging task even in relation to a single data source, as multiple diagnosis codes may be used to describe similar or related conditions and their data. This challenge is amplified when data from multiple sources, recorded using different medical ontologies, are combined. Figure 2 illustrates this, using as an example data for one individual from the three EHR sources in our hypothetical research question. In this example, an AF diagnosis is recorded at three different time-points: as a secondary diagnosis during a hospital admission, in the primary care record after hospital admission information is transferred to their general practitioner, and as a primary diagnosis when the patient is admitted to hospital for an AF-related surgical procedure. This information needs to be reconciled in order to determine not only if, but also when, a diagnosis occurred.
Reconciling coded information from multiple sources is made more challenging by the different medical ontologies that are used by each source. For example, in the UK, primary care sources use Read codes, a subset of the SNOMED-CT clinical terminology, whereas secondary care and mortality sources use the ICD-10. Combining data recorded using these systems for a single condition, such as AF, is not straightforward as the clinical resolution they offer can vary substantially; there are 23 Read codes relating to AF, including disease subtype classification, but only 1 ICD-10 code. Data-driven computational methodologies, such as support vector machines (SVMs), can be applied on unstructured data (e.g. clinical text, electrocardiographic (ECG) monitoring data) to further enhance and fine-tune the accuracy of algorithms utilizing coded data.4,51,52 For example, Mohebbi and Ghassemian53 created an algorithm that consists of a linear discriminant analysis based feature reduction scheme and a SVM-based classifier and were able to accurately (sensitivity 99.07%, specificity 100%, and positive predictive value 100%) detect AF cases using RR intervals extracted from ECG signals.
No standardized methodologies and mechanisms exist to help research-users define, share, and evaluate EHR-derived phenotypes in a consistent way, or to apply algorithms for creating these phenotypes to their own data, although development of tools for this is very active.54–56 The USA-based Electronic Medical Records and Genomics (eMERGE) Consortium has developed an AF phenotype algorithm57 that focuses on clinical notes and electrocardiogram impression data. These data are not available in CPRD, HES, or ONS, although there is a UK-oriented EHR phenotype resource called CALIBER that does contain an AF phenotype based on coded data from primary and secondary care,50 which could be applied in this situation. However, if no phenotype algorithm existed, we would need to go through the process of developing a new phenotype algorithm for AF, and we would need to repeat this process for every other variable we wanted to include in our final dataset such as gender and any covariates such as other CVD, smoking status, or hypertension.
Validation, preferably against a gold standard, is a key step of defining disease phenotyping algorithms.58 The goal of the validation exercise is to evaluate the accuracy of the algorithm: it is the phenotyping algorithm including all patients that are eligible and excluding all patients that are ineligible, thus accurately allocating them in the case and control groups. Some phenotypes, such as type 2 diabetes,59 are inherently complex as they make use of multiple data elements (e.g. diagnostic codes, medication information, laboratory measurements, and clinical text) and should ideally be validated through manual review of case notes in primary or secondary healthcare providers in order to understand the information the physician had available at the time of diagnosis. Clinical notes, however, are not available at scale due to information governance restrictions and scaling this process for large cohorts of patients is challenging and time-consuming. An alternative approach is to validate the developed phenotyping algorithms by conducting epidemiological analyses of the association of known risk factors and the phenotype in question and compare with associations found in other studies. Other phenotypes, such as white blood cell count, the goal of the validation exercise is to ensure that the algorithm included all eligible patients and discarded outliers and incorrect values.
How do I create a research-ready electronic health record dataset?
The process of applying phenotype algorithms to raw EHR data and creating a dataset that is ready to be statistically analysed requires several data transformations that are challenging due to data heterogeneity and complexity. Description of the process is rarely provided as part of academic outputs, and there is increasing recognition of the weaknesses that pervade the current landscape of EHR research in relation to sharing and standardization of data transformation methods.60 The prevalent scientific culture does not promote or reward sharing of standardized and re-usable data transformation libraries, which leads to substantial duplication of effort and increases the potential for a lack of reproducible results from EHR-based studies.
As for a conventional study, an EHR-based study requires a clear definition including the population from which individuals are sampled, inclusion and exclusion criteria, follow-up, and handling of missing data. For our example question, we may need to specify the age range of our patients, whether we are including individuals with prior cardiovascular conditions such as heart failure, and how missing data were handled, but there is additional information that should be reported for EHR data including: the data sources included, the end date of our follow-up data, whether there are exclusion/inclusion criteria based on data quality or other administrative information, details of new phenotype algorithms, and how data were multiply imputed if applicable. While this information can be described to some extent in the Methods section of a scientific paper, the associated computational manipulation and analyses are not standardized for EHR data, and there is currently no provision in scientific papers for detailed explanations of these methods or distribution of associated phenotype algorithms, computer software, or statistical/programming scripts.
Recommendations for advancing electronic health record research
Many countries in Europe, and internationally, have EHR systems that could be utilized for research; national, centralized resources that facilitate the steps from research question to research dataset would substantially enhance the research potential of these data sources. Initiatives are already underway to achieve this in some countries, but few tackle all aspects of this process.
The UK-based CALIBER platform61 combines a repository of EHR phenotypes with curated record linkages combining primary care (Clinical Practice Research Datalink), hospital discharge (Hospital Episode Statistics), disease registry (Myocardial Ischaemia National Audit Project47), and death registry (Office of National Statistics) data in over 2 million adults with 10 million person years of follow-up. However, this resource does not provide any tools for bidirectional interactions with EHR data sources. In contrast, the Clinical Record Interactive Search system (based at the NIHR Mental Health Biomedical Research Centre and Dementia Unit at the South London and Maudsley NHS Foundation Trust) allows researchers to investigate anonymized secondary care data, including clinical notes and other text, via novel user-friendly tools that facilitate identification of patients meeting certain criteria and development of text-mining algorithms.62 Finally, the eMERGE Network,54 a US National Human Genome Research Institute-funded consortium, combines a phenotype repository with EHR data from multiple secondary healthcare providers, including imaging and text, linked to genotypic data for all participants.
National EHR portals could combine the strengths of all these projects by including: (i) a national catalogue of contemporary EHR sources curated using metadata standards; (ii) an interactive thesaurus of EHR-derived phenotype algorithms; (iii) standards-driven tools that will enable researchers to visually create observational and interventional research studies (population, inclusion/exclusion criteria, sources, phenotypes). The national catalogue should support the harvesting and integration of metadata from external sources, and manual curation by researchers within a standardized and reproducible framework, as well as providing guidance on data access and content. This will allow users to identify data sources that can provide information both within and across disease areas. The EHR phenotype algorithms and dataset creation tools need to be implemented in a fashion that supports reuse and modification by other users, as well as appropriate academic credit and/or citation. Creating this type of resource will help to foster an ‘open source’ approach to EHR research in which researchers can collaborate and learn from each other, and this will ultimately produce a greater advance in EHR research than could be achieved by any research group in isolation.