Stephanie Garies, Richard Birtwhistle, Neil Drummond, John Queenan, Tyler Williamson, Data Resource Profile: National electronic medical record data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN), International Journal of Epidemiology, Volume 46, Issue 4, August 2017, Pages 1091–1092f, https://doi.org/10.1093/ije/dyw248
Data resource basics
Canada’s primary health care system
The Canadian health care system provides universal, publicly funded health services to its population. Coverage includes visits to primary care, specialists, hospitalization and, in some provinces, universal drug coverage. Health care is funded and managed within each provincial and territorial jurisdiction, under the oversight of the Canada Health Act and with financial assistance from the federal government.
Primary care is typically the first point of contact in the health care system and is responsible for much of the prevention, diagnosis and management of chronic disease in the community. Among high-income countries, Canada has one of the lowest rates of electronic medical record (EMR) use among family physicians.1 To this extent, Canada largely relies on administrative databases (e.g. physician billing claims, hospital discharge abstracts) and surveys to answer questions about primary health care. Many countries with a longer history of EMR use, such as the UK and The Netherlands, have been able to successfully harness these data for research and surveillance for many decades. In recent years, Canada has been rapidly catching up with its international peers in the use of EMR systems and the opportunity to use these data now exists. The detailed clinical information found in EMR data can provide a more comprehensive and contextually relevant perspective of primary care activities in Canada, which is useful for policy decision making, disease surveillance, health services research and clinical quality improvement.
The Canadian Primary Care Sentinel Surveillance Network (CPCSSN)
CPCSSN is the first, largest and only pan-Canadian primary care EMR database in the country. It began development in 2005 for the purposes of establishing a data source to support primary care research and national chronic disease surveillance, though increasingly, clinicians have found these data useful for improving patient care and practice efficiencies.
CPCSSN is organized as a ‘network of networks’, in which existing primary care practice-based research networks (PC-PBRN) across the country (Figure 1) unite to contribute de-identified patient data from their participating family physicians and nurse practitioners practising in full-service, community-based primary care clinics, who thus become CPCSSN ‘sentinels’. Recently, some PC-PBRNs have expanded recruitment to include community paediatricians, though currently this represents a small proportion of the data holdings. As data custodians, sentinels consent on behalf of their patients to participate in CPCSSN and all patients are therefore included in the database unless they have specifically opted out.2 Each province operates under this implied consent model with the exception of Quebec, where health legislation requires patients to give consent individually.3 Participating practices are given patient information posters and brochures to display within the clinic. These inform patients about the CPCSSN project and provide information about opting out.
CPCSSN has developed extraction processes to capture data from 15 different EMR systems across the country (Appendix 1, available as Supplementary data at IJE online). The routine extraction of de-identified patient data is performed remotely in order to minimize disruption to the participating clinics. Once the relevant data are extracted, they are transferred to a local server specific to each contributing PC-PBRN, where cleaning and coding algorithms are applied and the data are standardized into a common format. The data from each network are then merged into the national database. Both local and national databases are held on CPCSSN servers within the Centre for Advanced Computing4 at Queen’s University in Kingston, ON, Canada.
The comprehensiveness and variability of EMR data necessitates developing case definitions designed specifically for this data source. Although most Canadian primary care EMR products use the International Classification of Disease Version 9 (ICD-9) to code diagnoses, free text is also frequently used, which adds to the complexity of identifying patients with a particular condition. Case definitions for eight chronic and neurological diseases were created by CPCSSN using a variety of text words, ICD-9 codes and disease-specific criteria such as medications or laboratory results.5,6 The case definitions were applied to the CPCSSN database and validated by networks across the country using the original patient chart as the gold standard, where most definitions demonstrated very good sensitivity and specificity.5,6 These included hypertension, diabetes, osteoarthritis, depression, chronic obstructive pulmonary disease, dementia, epilepsy and parkinsonism. Definitions for additional conditions are under development.
Each PC-PBRN has received research ethics approval for the CPCSSN project at their respective host universities. Any studies external to current CPCSSN approvals require separate research ethics approval from the researcher’s home university and an Institutional Data Sharing Agreement with Queen’s University.
Data security & confidentiality
CPCSSN has taken numerous measures to ensure the highest data security and adherence to privacy policies.2 All data are transferred securely using an encrypted Virtual Private Network (VPN) connection. Both regional and national CPCSSN data are stored in the Centre for Advanced Computing (CAC) at Queen’s University, a high-quality computing facility with multiple layers of physical and digital security.4
Each contributing PC-PBRN has conducted at least one Privacy Impact Assessment, involving extensive documentation of all project-related processes with an assigned risk score for each privacy requirement, the development of risk mitigation procedures, and a review of provincial health legislation and university-specific security policies as applicable to the project.
Ensuring patient and provider confidentiality is a core value of the CPCSSN network. The names and personal information of participating sentinels are never released without their explicit consent. Re-identification of patients cannot occur outside the clinic environment or without the consent of the sentinels and separate approval from the research ethics board, giving sentinels full control over their patients’ sensitive information.
Data resource area & population coverage
As of May 2016, there were nearly 1200 sentinels participating in CPCSSN from over 200 practice sites.7 Clinical and demographic information for more than 1.5 million patients is contained within the database, with approximately 700 000 patients recording at least one clinic visit in the previous 12 months. PC-PBRNs recruit practices from most provinces and territories across Canada, including British Columbia, Alberta, Manitoba, Ontario, Quebec, Newfoundland and Labrador, Nova Scotia and the Northwest Territories (Figure 1). CPCSSN continues to recruit sentinels and additional PC-PBRNs.
As a data source for surveillance and research, much consideration is given to ensuring that contributing patients, sentinels and clinics are representative of their respective Canadian base populations. As expected in a primary care sample,8,9 older adults and females are over-represented in the CPCSSN database as compared with the national population.10 As such, it is important to consider age and sex standardization and/or adjustment for all surveillance and research studies employing these data. It may also be likely that patients in the CPCSSN database have higher socioeconomic status than that of the general Canadian population.11 However, to date, the CPCSSN database does not contain the necessary information to allow users to properly adjust for systematic differences in socioeconomic status.
Participating physician and clinic characteristics were evaluated against the 2013 National Physician Survey, which collects information about location of practice (urban or rural), type of practice (academic or community-based), and age and sex of the provider.12 Sentinel physicians contributing to CPCSSN tend to be more often female and younger than physicians in general in Canada, though provincial variability is evident.10
Depending on when contributing clinics introduced their EMR system into their practice or when historical data were entered into the patient chart, CPCSSN data start at various time points, with some records going back to the early 1990s. Data from 2008 onwards are considered acceptably robust for analysis, as the increased uptake and more complete use of EMR systems over time has contributed to better data quality and volume in the later years.
Patient-level information is collected from almost the entire medical record, including non-identifiable demographics, current and historical diagnoses, medications, physical measures (such as blood pressure, height and weight), laboratory results, referrals, medical procedures, risk factors and physician billing submissions (if available). Data elements currently being captured are summarized in Box 1. At present, CPCSSN does not extract scanned documents because the data within these are not easily extracted and often include identifiable text that can be difficult to redact. Physician notes are also excluded for similar reasons. Due to the exceptionally large volume of laboratory and physical examination data, CPCSSN processes only those tests and values that are related to the eight conditions for which a case definition has been validated in Box 1. Additional laboratory and examination values will be added based on resources and priorities within the network.
CPCSSN has developed extensive cleaning algorithms to address data quality issues that can impede the use of EMR data. Many data elements are assigned to a cleaned field (including both ICD-9 codes and text words) alongside the original entry. The cleaning process takes into consideration the many different ways of entering diagnoses into the EMR, including abbreviations and misspelled words, and maps them to a diagnosis category based on ICD-9 classification headings.
Practice and provider data
Providers and practices are distinguished in the database by a non-identifying study ID. A select number of provider and clinic characteristics are available and linked to each patient record. These variables include type of provider (family physician, nurse practitioner or paediatrician), sentinel year of birth, sex, Canadian or foreign medical education, year of completed medical training, whether the clinic is academic or community based and whether in a rural or urban setting (Box 1).
To meet the technical requirements for ensuring CPCSSN data repositories contain only anonymized information, CPCSSN employed a Research Privacy and Ethics Officer to review and evaluate a three-phase approach to de-identification. The first phase was the exclusion of all structured direct and indirect identifier data fields (i.e. name, address, provincial health number) in the source EMR data. This is conducted during the initial data extraction phase by either the PC-PBRN data manager or directly through the EMR vendor. Alternatively, the PC-PBRN data manager can enter into a confidentiality and security agreement with the clinic, which permits the data manager to access the most recent back-up file stored in an encrypted folder on the EMR server. Once the health data are stripped of any directly and indirectly identifying information from the structured data fields, only data from specific EMR fields are extracted.
The second phase of de-identification is the application of algorithms to replace patient identifiers that may appear in unstructured, free-text fields with random digits (for example, a phone number 316‐544‐8371 would become &l;tel#>). First and last names, as well as regular expression pattern matching for other identifiers, are suppressed and replaced by a series of X’s.
The third and final phase of data anonymization is the application of the PARAT tool13 to further reduce the statistical risk of re-identification by combinations of different fields. If the tool detects higher or unacceptable levels of potential re-identification risk, it suppresses fields within certain records or reduces the level of detail within certain fields, such as dropping one or more characters in a postal code or rounding the year of birth to 5-year bands. The PARAT tool is applied before the release of CPCSSN data for ethics board-approved research.
Data resource use
CPCSSN provides a unique data source not currently available elsewhere in Canada. The national CPCSSN data have been used to answer a variety of relevant primary care research questions, with over 45 publications and 250 conference posters and presentations to date. Most notably, along with the publication of the validated CPCSSN case definitions,6 we have explored the epidemiology of these conditions in Canadian primary care settings.14–19 Additionally, CPCSSN has an emerging role in pharmacovigilance, where the data can be used to monitor adverse reactions in the post-marketing surveillance of medications prescribed in primary care in Canada.20
CPCSSN data are used at regional and local levels to answer research questions derived from these contexts, as well as at a provider and clinic level for quality improvement purposes. For instance, a family physician in rural southern Alberta was able to examine patient outcomes related to a lifestyle intervention developed specifically for obese patients, which was completed without cost to the clinic and assisted the clinic in empirically evaluating their programme.21 A primary care team in Toronto, ON, used CPCSSN data to create registries of patients with chronic diseases to assist with their clinical management and monitoring patient outcomes.22
CPCSSN data have also formed the basis of multiple student research projects at the graduate and undergraduate levels, including several important methodological advancements to which students have contributed. One such project explored the opportunity for using CPCSSN data as a source of national healthy weight data.23 Whereas the data have some important limitations that need to be carefully considered, the CPCSSN database contains, for example, more body mass index (BMI) records than all the objective BMI measurements collected by Statistics Canada health surveys over the past 20 years.23
More recently, the Public Health Agency of Canada has funded CPCSSN to further develop, implement and evaluate the CPCSSN Data Presentation Tool (DPT) in primary care clinics and departments of public health across the country.24 The CPCSSN-DPT is a customized web-based graphical interface that provides users with ready access to clinic- or jurisdiction-specific CPCSSN data after it has undergone processing and cleaning. It is anticipated that the CPCSSN-DPT will facilitate the adoption of public health methods of surveillance by primary care practitioners to enhance the monitoring, prevention and management of chronic disease across Canada.
Linkage studies combining CPCSSN data with those from other sources of health, social or census data are a significant opportunity for research that has recently been explored in several provinces. Primary care EMR data from CPCSSN linked with administrative sources provide a powerful method for following patients throughout the primary-tertiary health care system, contributing significant insight into important topics such as high system use and predicting hospitalizations, and including social determinants when reporting on chronic diseases. This type of linkage research has taken place in local networks (for instance, CPCSSN data linked with census data for studying socioeconomic status and obesity25 and linkages with deprivation scores to examine the socioeconomic influences on diabetes health26) and more broadly within several provinces, such as collaborations with the Institute of Clinical Evaluative Sciences (ICES) in Toronto, ON, the Manitoba Centre for Health Policy (MCHP) in Winnipeg, MB, and the Newfoundland and Labrador Centre for Health Information (NLCHI) in St. John’s, NL. Other CPCSSN networks are closely following suit, with linkage activities beginning to take place in their respective provinces.
Researchers across the country are currently using the CPCSSN data to develop EMR-specific definitions for pelvic floor disorders in women, childhood asthma, speech disorders in the elderly, chronic kidney disease, chronic pain and heart failure. Plans to include menopause, inflammatorry bowel disease, multiple sclerosis and injury from falls are under way. Researchers with an interest in a specific condition are welcome to present proposals for new case definition development and validation using the CPCSSN dataset, including acute and communicable diseases.
Strengths and weaknesses
A key strength of the CPCSSN database is the ability to follow patients over time and perform longitudinal, patient-level analyses using up-to-date clinical data. CPCSSN data are more comprehensive and fine-grained than traditional sources of Canadian primary care information, such as physician billing claims and other administrative datasets. Using CPCSSN data reduces subjective biases found in self-reported health surveys, since EMR data comprise physician-identified diagnoses, objective laboratory and examination results, and prescriptions issued to patients.
One distinctive feature of CPCSSN is the ongoing cleaning, coding and standardization of the data extracted from multiple EMR systems, which are often entered as free text at the clinic. Without these continuous cleaning processes, the data would not be useable for research and analysis.
Further, the national scope of the database is a major asset. In Canada, health is federally mandated and provincially administered, meaning that administrative data are housed within each province; this makes inter-provincial comparisons complex and time consuming. CPCSSN amalgamates data from different provinces into a single federated database in a privacy-sensitive way, allowing for an inexpensive and cohesive source of national primary care information.
The challenge of using EMR data for purposes other than clinical care is to transform them so that they are fit for secondary use (i.e. research and surveillance). CPCSSN’s cleaning and coding algorithms have converted unstructured, unformatted clinical information into searchable data items, though there are still limitations imposed by the overall data quality inherent in most EMRs. Large blocks of narrative text are especially difficult to clean and parse out useable information. Often behavioural risk factors, such as smoking, diet, alcohol use and exercise, are documented in this way.
Missing data are not uncommon–for instance, less than 3% of patients in the CPCSSN database have their ethnicity recorded. Ultimately, CPCSSN is only able to extract data that are entered into the EMR in some reasonably useable manner. Therefore, the data available are limited to: (i) those patients who attend community, primary care clinics; (ii) data that are entered into the record; and (iii) data that can be directly accessed and, if necessary, can be cleaned and coded.
Additionally, the health system in Canada is organized so that patients may seek care from multiple physicians and other providers of their choosing. Although new primary care models emphasize patient rostering, whether formal or informal, it may be the case that some patients exist in the CPCSSN database more than once. At present, CPCSSN is unable to differentiate duplicated patients within its anonymous database. Whereas this problem is believed to be small, it does present a limitation to the current structure of the CPCSSN database, especially as expansion of the patient sample continues. As well, it is difficult to monitor attrition from patients leaving the clinic, moving to a new city or province or dying, as this is not always known or recorded in the EMR.
Last, the non-random enrolment of providers may impart a selection bias, as sentinels are likely early adopters of EMR systems with an interest in research and quality improvement. The CPCSSN database excludes providers using paper records, though this is quickly becoming an obsolete practice.
Data resource access
Researchers interested in conducting primary care research using CPCSSN data are encouraged to visit the website at [www.cpcssn.ca] (English version) or [www.rcsssp.ca] (French version). A CPCSSN Data Product Package is available by request, which contains the data dictionary, the entity relationship diagram (ERD) for the CPCSSN database, a short presentation summarizing the data holdings and its potential uses, and a sample dataset of 200 anonymized and delinked patient records. Qualified researchers are able to submit a letter of intent online, summarizing the proposed research project using CPCSSN data. After an internal review by CPCSSN’s Surveillance and Research Sub-Committee, applicants are invited to submit a full protocol and provide their letter of research ethics approval. A secure transfer of the CPCSSN dataset is initiated after the appropriate documentation is complete.
The CPCSSN data are available to university-affiliated (or equivalent) health researchers on a cost recovery basis, and a discounted rate is available for students. CPCSSN can provide additional services, such as data manipulation or analysis, for a nominal fee.
CPCSSN also welcomes collaborative research; please visit the website for information about current research projects, publications and CPCSSN co-investigators across the country. Any additional queries can be sent directly to CPCSSN at [firstname.lastname@example.org].
Supplementary data are available at IJE online.
The Public Health Agency of Canada initially provided a substantial contribution agreement to begin the development of CPCSSN in 2005, and has continued to provide smaller project-specific funding. Since 2015, additional funding support has been received from Canada Health Infoway, Health Canada, several provincial health ministries, universities and the private sector.
Conflict of interest: None.