Widespread use of medical records for research, without consent, attracts little scrutiny compared to biospecimen research, where concerns about genomic privacy prompted recent federal proposals to mandate consent. This paper explores an important consequence of the proliferation of electronic health records (EHRs) in this permissive atmosphere: with the advent of clinical gene sequencing, EHR-based secondary research poses genetic privacy risks akin to those of biospecimen research, yet regulators still permit researchers to call gene sequence data ‘de-identified’, removing such data from the protection of the federal Privacy Rule and federal human subjects regulations. Medical centers and other providers seeking to offer genomic ‘personalized medicine’ now confront the problem of governing the secondary use of clinical genomic data as privacy risks escalate. We argue that regulators should no longer permit HIPAA-covered entities to treat dense genomic data as de-identified health information. Even with this step, the Privacy Rule would still permit disclosure of clinical genomic data for research, without consent, under a data use agreement, so we also urge that providers give patients specific notice before disclosing clinical genomic data for research, permitting (where possible) some degree of choice and control. To aid providers who offer clinical gene sequencing, we suggest both general approaches and specific actions to reconcile patients’ rights and interests with genomic research.
With the broad adoption of electronic medical record (EMR) systems, researchers can mine vast amounts of patient data, searching for the best predictors of health outcomes. Many of these predictors may lie in the genome, the encoded representation of each person's DNA. As gene sequencing continues to evolve from a complex, expensive research tool to a routine, affordable screening test, most of us are likely to have our DNA fully digitized, vastly expanding the already large store of electronic health data already preserved in or linked to our EMRs. In parallel, genomic researchers will, increasingly, seek out EMRs as an inexpensive source of population-wide genome, health, and phenotype data, thus turning patients into the subjects of genomic research. This will often occur without the patients’ knowledge, let alone their consent, in a research climate where the privacy risks are routinely discounted and data security can be uncertain. The implications, both for research and for privacy, are profound, but the prospect has received little attention in the literature.1
The widespread re-use of health information in EMRs is already commonplace, but those records typically don't include detailed genomic information.2 The landscape is changing, however, as technical advances make sequencing and storing patient genomes increasingly affordable, and as providers and academic medical institutions—along with government, science, and industry—envision using genomic data to enable ‘precision medicine’.3 As more patients have genomic data linked to their medical records, absent a change in policy or practice we will see the same non-consensual re-use of these data already allowed for other forms of health information.
Advocates of the status quo argue either that there is little real re-identification risk for genomic data (the ‘privacy through obscurity’ theory) or in the alternative, that if the risk is real, the consequences are minor, because relative to other forms of health data, information about genetic variation is less stigmatizing, less valuable, and, therefore, less attractive to hackers and criminals.4 The net effect of these rationales is a privacy standard for DNA sequences much lower than what currently applies to data elements such as URLs, fingerprints, and zip codes—each enumerated as an identifier under the Privacy Rule and protected when linked to health information.
Moreover, even assuming arguendo that genome sequence data don't constitute particularly sensitive health information, it is becoming difficult to maintain that a gene sequence (or substantial subset thereof) is not an ‘identifier’ that places any associated health or demographic information at risk, when databases of identifiable sequence data are proliferating and researchers are exploring ways to sequence DNA rapidly for use as a biometric identifier.5
And, finally, at the heart of this issue lies an important ethical, and practical, question: Should the scientific and provider communities continue to disregard the accumulating evidence from repeated studies that patients expect to be told about, and to control, research uses of their genomic and health information?6
The prospect of eventual, widespread EMR-based genomic research under current privacy practices drove us to write this paper. The paper proceeds in five parts: setting out the problem, reviewing the current status of records-based biomedical research, noting other secondary uses of medical records, describing the conflict between individual rights and societal interests implicated in genomics-based research, and providing our recommendations for a balanced approach.
We acknowledge the vigorous debate over almost every aspect of the problem of genomic privacy: whether genomic data are identifiable, whether it is likely that anyone would try to re-identify a subject of genomic research, whether patients have an obligation to participate in such research regardless of personal preference. Our paper builds on the 2008 recommendations of the Personalized Health Care Work Group of the US Department of Health and Human Services (‘DHHS’) American Health Information Community, which advocated special protections for the research use of genomic data in EMRs, arguing that such data are exceptional relative to other sensitive information due to their uniqueness and potential for re-identification.7 Without engaging the debate over ‘genetic exceptionalism’, we maintain that it is still useful here to draw a line—even if it is in sand—and to insist that if patients have any genuine right to understand and influence the uses of any of their sensitive medical information, such a right must include their genomes. That all bright lines are imperfect does not mean no lines are useful.
Although we do not call for legal or regulatory changes, we question whether current federal health privacy law, properly interpreted, actually permits health care providers, whether clinicians or academics, to treat whole genome sequence data as ‘de-identified’ information subject to no ethical oversight or security precautions, especially when genomes are combined with health histories and demographic data. We recognize that pending amendments to the federal Common Rule might affect and even further strengthen our argument, especially if, as proposed, IRBs would no longer oversee much secondary research involving medical records (as discussed below in Section II.A.2). We do not discuss those proposed changes in detail. The Common Rule amendments have been pending for half a decade, since the Advance Notice of Proposed Rulemaking (ANPR) was published in July 2011, so we do not assume that relevant regulatory changes are imminent or that their final form is predictable.
We conclude by offering standards (versus new regulations), for individual providers and provider institutions (eg academic medical centers, HMO, and large medical practices) to follow in dealing with both patients and researchers interested in genomic data of those patients. In these standards, we propose a model point-of-care notice and disclosure form for EMR-based genomic research. We call for rigorous data security standards and data use agreements (DUAs) in all EMR genomic research, but note that DUAs are relatively toothless without the means to audit compliance and penalize non-compliance.8 We acknowledge the limitations of any model of permission or consent, recognizing that such models can't anticipate every legitimate use or disclosure occurring in connection with research. At the same time we do not agree that, at least in American culture, there is popular support for the view that all patients have a legal or ethical obligation to become subjects of all secondary records research, however, valuable the science. Finally, we consider how researchers might encourage patient participation by sharing more information about the research, more quickly, with the patients whose data they obtain.
The stakes are high and time is limited. There are compelling reasons why researchers want and need to combine EMRs with genomic data. Without new steps to promote disclosure and awareness, one day the public will discover that medical and genomic information it assumed was confidential is in fact used widely, and at some privacy risk, in research the subjects neither consented to nor even knew about. This discovery could become an ethical, practical, and political landmine—one that we can, and should, avoid.
A health care provider must protect any health information associated with identifiers such dates of treatment, zip codes, and URLs, but that same provider may, under current federal law, give a patient's genome to anyone who asks for it. This is because the federal medical Privacy Rule, promulgated under HIPAA (the federal Health Information Privacy and Accountability Act of 1996), includes dates and URLs among a list of 18 enumerated identifiers whose use and disclosure is regulated, but doesn't specify that DNA sequence data constitute an identifier.9 In a subsequent regulation implementing the federal Genetic Information Non-Discrimination Act (GINA), federal regulators amended the Privacy Rule, clarifying (in response to arguments to the contrary) that genetic information is considered health information under the Rule, but left open the question of when such information becomes identifiable absent links to other enumerated HIPAA identifiers.10
Currently, actual genomic datasets, whether obtained through gene sequencing, exome sequencing, or whole genome sequencing (WGS), typically are not linked to clinical medical records, although genomic test reports and summary data already appear in an ever-increasing number of EMRs. Consider, for example, the hundreds of thousands of BRCA1 and BRCA2 tests performed annually for clinical purposes (the results of which will appear in the medical record), as well as the burgeoning practices of sequencing children with mysterious illnesses (and their parents) in an attempt to determine whether a given condition is linked to a genomic mutation. Most often stored separately on research servers, the genomic data obtained for these purposes is likely to remain linked to patient identities and medical data and preserved for future interrogation as researchers find new, disease-linked variations in the human genome. Notably, the National Human Genome Research Institute is funding a number of pilot projects to explore clinical sequencing in populations ranging from oncology and primary care patients to cardiac patients and those with intellectual disabilities.11
Aside from potential clinical uses, gene sequencing is common in research, where it often occurs without the specific consent of the persons whose DNA is sequenced. In fact, current medical research norms permit a scientist who has access to previously collected samples of a patient's blood or tissue to sequence that patient's genome without asking the patient to consent to sequencing. (At best, the patient whose clinical specimens are sequenced for research may have signed a clinical consent form containing an inconspicuous, somewhat vague disclosure that samples and data may be shared for unspecified future research.) The scientist then may, and in some cases (eg if a recipient of NIH funding for the sequencing) must, share the resulting genomic data with others, including sending the dataset for inclusion in federal government databases used by researchers and companies worldwide, usually without any additional notice to the patient.12
The main ethical and legal justification for this practice is the long-standing assertion that a genome constitutes ‘de-identified’ information, the disclosure of which poses no significant privacy risk.13 Yet quietly, but with increasing urgency, medical researchers are debating whether subjects of genomic research can reasonably expect to remain anonymous, as some new studies suggest future re-identification is increasingly possible, if not probable.
Meanwhile, the focus of genomic research is shifting from individuals to populations, from small laboratory collections of DNA to vast databases of genomic and health information, with corresponding privacy implications for increasing numbers of people. The Precision Medicine Initiative, announced with fanfare by President Obama in January 2015, is accelerating this shift. Chief among the databases of interest to researchers will be the burgeoning EMR systems maintained by the nation's health care providers—the physicians, hospitals, laboratories, and insurers who create and maintain health care data.
Technology is changing not only how researchers study DNA, but also how providers manage clinical data. Due in part to federal financial incentives, EMRs have now become the standard for US medicine, replacing the familiar paper chart.14 This is becoming true even for physician groups, which have lagged hospitals, laboratories, and insurers in adopting EMRs. In digital format, this immense, increasingly cross-institutional and networked collection of health information, from medical histories and patient demographics to treatment outcomes and laboratory test results, affords researchers new opportunities to amass and study large volumes of health outcomes data.
These trends in genomics and data storage are converging, as it becomes apparent that the data used by medical providers will eventually include rich genomic information.15 No less an expert than Dr. Francis Collins, the director of the National Institutes of Health, has expressed his anticipation that once storage in the EMR becomes possible, patients’ genomes can and should be sequenced and the data made available for clinical care and research.16
The coming collision: modern genomics and medical privacy
Advances in data science and information technology are eroding old assumptions—and undermining researchers’ promises—about the anonymity of DNA specimens and genetic data. The term ‘de-identification’ does not mean what the typical patient might expect: in fact, a ‘de-identified’ file with both genomic data and traditional medical data, including demographic information on the patient, increasingly can be ‘re-identified’, either by connecting the genomic data to a source with identified genomic data or by connecting the medical data to an individual.17 At best, the term ‘de-identified’ is a probabilistic statement about the perceived small likelihood of such re-identification.18
Databases of identified DNA sequences are proliferating in law enforcement, government agencies (eg the military, state health department newborn testing programs), genealogical databases, both commercial and public, and commercial direct-to consumer genetic testing enterprises, continually increasing the likelihood that a de-identified gene sequence could be re-identified (linked to a specific individual) if obtained by a person or entity with access to such ‘reference’ databases.19 Substantial steps toward re-identification could be taken even by someone capable only of linking a file to identified genomic data of a first, second, or third degree relative of a data subject—relationships readily ascertainable from dense genomic information.
Beyond direct comparison to an identified DNA database, re-identification may also be possible when a third party defeats the de-identification measures used to protect the phenotype data (eg demographics and medical history,) typically linked to the genomic data used in research. Current de-identification practices for phenotype data generally involve removing specific data fields, such as names, addresses, and zip codes, but are not a guarantee of anonymity. Rare combinations of health and demographic data may leave specific individuals within a de-identified data set at a not insignificant risk of re-identification.20 Ironically, this is particularly true among populations with a high incidence of a particular genetic disease for which research is needed.
And new re-identifications risks will emerge as scientists learn to profile individuals using information encoded in the genome itself, such as height, ethnicity, hair color, and eye color. This future is not mere theory or science fiction: authors of a 2014 study published in PLOS Genetics describe a method to use the genome and computerized rendering software to ‘computationally predict’ three-dimensional models of individual faces; the authors foresee widespread use of these techniques within a decade (See Fig. 1).21 Physical attributes such as height, whose phenotypic expression is influenced by the environment and by multiple genes, may never be genetically profiled with precision, and ‘gene photofitting’, by itself, may never yield an absolute identification. These techniques will, however, be able to eliminate vast numbers of possible sources for genomic information and, in combination with the de-identified medical information routinely shared for genomic studies, could elevate the re-identification probability for gene sequence data.22 Debates about re-identification often overlook this type of profiling risk, which is independent of the availability of any reference database.
Lastly, patients are, unwittingly, multiplying their own re-identification risk by transferring increasing amounts of their own identifiable health data to the web via Internet-based personal health records, genealogical tools, interactive medical devices, and even Google searches for disease sites and treatments. A typical de-identification scheme for health data never considers the cumulative identifiability of the health information an individual distributes across the Internet.
Today, medical ethicists, lawyers, and data scientists debate whether de-identification remains a reliable means of privacy protection. One camp maintains that the risks of re-identification are overstated, creating a climate that impedes research unnecessarily; another group of experts, the ‘re-identification scientists’, counter by demonstrating repeatedly how they can re-identify supposedly anonymous subjects in genomic research databases.23
Yet to date, this debate has been largely academic, concerned primarily with the privacy of subjects in discrete research studies. Gene sequencing technology is only now maturing into clinical use, and the number of persons whose genomes have been sequenced for research in the USA is relatively small compared to the total patient population. Though many of these research subjects contributed DNA before the advent of sequencing technology and are almost certainly unaware that their genomes have been sequenced and shared, most did consent to participate in some form of medical research and provided DNA samples for this purpose. In theory, at least, these subjects all knew they were assuming new privacy risks arising from research.
This is about to change, and the consequences merit careful consideration.
The impetus for change will be the movement of gene sequencing from the research laboratory to the clinic. When the day arrives that most patients’ genomes are sequenced routinely in the course of medical care, genomic data will be integrated in or linked to medical records.
The vehicle for change will be EMRs, which are rapidly replacing the traditional paper medical chart. EMRs that contain (or link to) gene sequence information will become a treasure trove for genomic research on a population-wide scale, allowing researches to forego recruiting DNA donors in favor of obtaining genomic data directly from the EMR.24 Current accepted practices for records-based research, including waiver of HIPAA authorization and ‘de-identification’, could, if extended to include EMR genomic information, result in both genomes and health data distributed to networks of researchers throughout the country and, in some cases, around the world—all without the knowledge or permission of the patients themselves.25 Calls to address privacy risk simply by penalizing re-identification attempts ignore the sad reality that data breaches, though illegal, are reported with increasing frequency for everything from financial records to political documents to health records, yet, while data custodians may be penalized, there are few reports of arrest, conviction, and punishment of the offenders who commit these breaches.
EMRs will transform records-based research
Electronic storage of clinical data is widespread: hospitals and health systems were early adopters of EMRs, and the National Center for Health Statistics reports that as of 2013, nearly 80 per cent of office-based physicians used some sort of electronic records system, many in response to multi-billion dollar federal incentive programs.26 Though designed primarily to improve health care delivery and facilitate reimbursement, EMRs, with their large volumes of readily transmissible patient data, are becoming equally essential to medical research. Digital health data are so easily exported from clinical records that in a single project, a researcher using EMRs can study the health outcomes of thousands or even (in large health systems) millions of patients. By pooling data from the EMRs of multiple provider institutions, researchers have also begun to follow health trends and examine health outcomes in entire populations.
Virtually every American who receives health care has—or soon will have—an EMR combining health information with demographic data such as height, weight, birth date, and address. Already, an estimated 40 per cent of the American population has medical record information stored in an EMR manufactured by a single company Epic, a leading supplier of EMRs to academic medical centers and large health systems.27
The utility of a common electronic platform for data-driven patient care is already apparent. Epic has created an electronic health information exchange (HIE) among more than 200 institutions. Over a million records per month are shared across this exchange for patient care purposes, but this extensive network has also enabled novel research: a 2014 study pooled emergency department records across four Epic institutions and found that use of the Epic HIE avoided more than 560 duplicate diagnostic procedures during the 9-month study period.28
In short, EMRs permit research on a scale—and with a degree of predictive power—that was inconceivable in a world of paper medical charts. Because EMR-based research is so possible and so potentially powerful, most patients in large health systems are also becoming research subjects. The only apparent rate-limiting factors are persistent interoperability problems, particularly across platforms, and the variable quality of EMR data, which tends to be worst during the initial years of transition from paper-based systems.29 Importantly, however, most EMR research happens outside the awareness of patients, under laws that facilitate the research use of health data.30
WGS will become the clinical standard of care
Paralleling the expansion of EMR systems in medicine, a technological revolution in genomics has increased the speed and, to a remarkable degree, reduced the cost of decoding, or sequencing, an entire human genome. While at least a half billion dollars were spent to sequence the first human genome a decade ago, for a few thousand dollars it is now possible to sequence any patient's DNA and preserve all the sequence data for future use.
Today, at the request of a treating physician, a laboratory might sequence a single patient's genome to detect information relevant to that patient: namely, a small but growing number of genetic variants known to signal disease susceptibility or predict medication response. WGS is not yet common in medical practice because analytic and reporting techniques vary, and because for any given disease, insurers remain uncertain whether WGS is a medically necessary diagnostic service that merits reimbursement.31 Studies also suggest that it is premature to use WGS to screen healthy adults because the reliability and clinical validity of many findings remains unclear.32
But these are short-term obstacles; consensus opinion holds that in the future, clinical demand for WGS will only increase. Similarly, other forms of genomic testing, such as whole exome sequencing (sequencing of the highly identifiable, protein coding regions of the genome) or sequencing of particular panels of genes or other significant genomic regions, may gain popularity as a more cost-effective alternative. In the next two decades, it is quite possible that some kind of genome sequencing will become standard clinical practice for newborn babies.33
To meet this demand, EMR vendors will be driven to solve what are, for the moment, daunting challenges: how to store very large gene sequence files (or allow the EMR to interrogate the databases where these data are stored); how to display genetic test results in standard format; how to create decision support tools to make the results meaningful to clinicians who are not genetic counselors.34 To facilitate insurance coverage and claims processing, regulators, laboratories, and professional medical societies will eventually develop common standards for reporting sequence data and coding sequencing services.35
EMRs will become a compelling tool for genomic research
While clinicians can use gene sequencing to diagnose known genetic conditions and predispositions, researchers are using this technology to identify new genetic factors in disease. Scientists combine WGS data (and similar subtypes, such as whole exome sequences), along with demographic and health data, to hunt for new genetic markers that correlate with health conditions. This research technique is one example of what is known as a genome-wide association study (GWAS).36 Earlier GWAS efforts used data from inexpensive array technologies to study markers in the genome called single nucleotide polymorphisms (SNPs). SNP-based analysis almost always provided, at best, disappointingly weak associations between particular SNPs and diseases or traits. GWAS using WGS data should be much more powerful.
GWAS requires big data: GWAS researchers often assemble databases containing not only genomes, but information culled from the medical histories of thousands of patients. Such databases are expensive and time consuming to create in the traditional research model, where each DNA donor is recruited and consented as a study participant, each DNA sample is sequenced using research funds, and the relevant medical information must be extracted from each donor's medical chart.
Within the next decade, however, as gene sequencing becomes more common in clinical medicine, it is likely that the data necessary for more powerful, sequence-based GWAS will already exist in (or be linked to) EMR systems. When insurers begin to pay for sequencing in the course of routine care, this trend will accelerate.
As this happens, the totality of the sensitive information embedded in the genome—information about risk of future diseases or addictions, traits and susceptibilities shared with relatives and children, actual biological relationships and ancestral origins, and an unknown quantity of information, yet to be discovered, about the relationship between genes and health—will become an enduring part of EMRs.37 This does not mean that everyone will have profoundly important or sensitive information in his or her genome, let alone a personal ‘future diary’.38 Still, a significant number of people will—and few if any will know in advance whether they are among those with such sensitive genomic information. This proliferation of clinical genomic data will occur just as the use of EMRs for research becomes commonplace, under norms that don't require patient consent.
Of course, to date GWAS has not been an unvarnished success, and as noted previously, the validity of EMR data can be variable.39 Nonetheless, in the long run financial incentives strongly favor EMR-based genomic research, as scientists who make secondary use of clinical genomic data bear neither the cost of gene sequencing nor the effort and expense of consenting individual patients and collecting project-specific phenotype data.40
De-identification is a moving target
For decades, medical ethicists have approved and regulators have allowed the non-consensual use of clinical records in research on the basis of one core assumption: that removing common identifiers such as names and Social Security numbers from the data nearly eliminates the risk of harm. This approach, once quaintly termed ‘anonymization’, is currently known as ‘de-identification’ (reflecting a growing understanding of the probabilistic nature of re-identification).41 When data are de-identified, anonymity isn't, technically speaking, guaranteed: instead, identifiers are removed or masked to the point where the probability of re-identification appears at a given point in time to be (as specified in one federal regulation) very small.42
Yet, even if de-identification can protect many forms of health data by reducing the probability of re-identification, genomic data in their raw (non-transformed) format—or as a list of variants from a standard (reference) genome—may be unusually vulnerable to future changes in the level of re-identification risk. Unlike a blood type or a cholesterol test result, an individual's DNA sequence codes for unique combinations of physical traits that, collectively, may create a fully or partially identifying profile.43 The more scientists learn about genetic profiling, the more this profiling re-identification risk will escalate. Meanwhile, the more commonly discussed possibility of re-identification via comparison of anonymous sequences with identified DNA databases in the public and private sector will also remain a growing risk.44 In either case, to the extent genomic data are linked with ‘de-identified’ phenotype data, re-identifying a gene sequence will also mean re-identifying all of the EHR health and medical data associated with that sequence.
Skeptics might discount re-identification risk by arguing that no one would have much incentive to re-identify genomic information when other information stores, such as banking information, offer more low-hanging fruit. Apart from law enforcement and national security interests in genomic re-identification and profiling, however, one could easily foresee other motivations for genomic re-identification, from tabloid appetites for celebrities’ medical information to sophisticated targeted marketing efforts, as well as profiling for life insurance and other purchases (unlike for health insurance, genomic profiling for life insurance or credit risk is not prohibited by federal law). The return on reidentification efforts will likely increase as technology improves and medicine can tell us more about the implications of genomic variation.
The obvious solution might seem to be technical innovations that might make genomes less identifiable. Although data scientists now proffer a variety of algorithms that purport to transform genomic data into less-identifiable forms, the genomic research community has not embraced these techniques or adopted any standard for data transformation. It is possible that the transformations necessary to reduce the re-identification risk degrade the informational value of a genomic sequence to an unacceptable degree; more likely, scientists may want to preserve their access to raw, untransformed sequence data for future use.45 In either case, technology has yet to provide an attractive solution to the re-identification problem.
De-identification and its limits are more significant for records research as clinical data become electronic. The significant time and effort required to abstract data from paper medical charts manually have always constrained the size of research databases, limiting the aggregate privacy risk to patients. Electronic health data change this risk calculus in important ways: a typical EMR in a large health system contains tens of millions of records, and the effort required to export records is the same regardless of the number of records. By pooling data in multi-institutional studies and drawing upon multi-state electronic HIE systems, it is foreseeable that researchers might one day access the health data and the genomes of the majority of Americans on a continual basis. Indeed, this is the model envisioned by some policymakers and embodied in the concept, endorsed by the National Academies of Sciences Institute of Medicine, of a ‘learning health system’.46
And why not, if privacy is protected? The conventional view, reflected in a 2012 report of the President's Commission for the Study of Bioethical Issues, is that the benefits of new knowledge substantially outweigh the privacy risks of genomic research, provided that researchers remove direct identifiers (eg names and addresses) from the data. The President's Commission analogized DNA to a fingerprint that does not encode identifying information and may only be identified if matched to a print from a known individual.47
This characterization is insufficiently forward looking: it neglects the rapid growth in the number of public and private reference databases of information that could be used to make a re-identifying match, whether those databases as genotypic, medical, or genealogical. It also fails to account for the re-identification risk stemming from the future prospect of genomic profiling, the compilation of an identifying list of physical features using only information encoded in the sequence data.48 And, most fundamentally, it assumes that individuals’ interests and rights in the use of personal information are disposable if some third party concludes overall benefits outweigh overall risks. Individual rights generally don't work that way.
The research community still maintains the perplexingly naive attitude that most data research, including genomic data research, should be considered ‘minimal risk’. Compelled by government mandates, research institutions spend millions of dollars each year on compliance systems to reduce the statistically rare incidence of physical harm to research participants in clinical trials. Yet the same institutions often participate in large-scale secondary data use projects where hundreds of thousands or even millions of patient records are exported to third parties, sometimes with little effort, apart from a DUA, to ensure that data storage and access procedures meet security best practices. Recent massive commercial and government data breaches—and in particular, breaches of large health systems (the majority of which have now been compromised in some way)—demonstrate that few data systems are invulnerable, so it seems realistic to assume that breaches of large research databases are inevitable.49 When this happens, the unaware participants may face real privacy and identity theft risks (medical identity theft is one of the fastest growing, and most expensive consequences of health care data breaches, imposing significant costs and burdens on patients and providers), and institutions themselves may be exposed to the very significant cost of providing credit monitoring, in addition to regulatory penalties and legal liability.50
Even perfect de-identification would not be enough
But assume, for the moment, that perfect de-identification—in essence, the elimination of re-identification risk—were possible. Would a reasonable patient still have grounds to object to use of her health data and genome for research? Some commentators argue that patients would, and the available data seem to support this view.51 Patients generally expect to exercise control over research uses of their information, and subgroups may actually object to certain uses.52 Whatever researchers, lawyers, and ethicists think of patients’ rights, to the extent that patients think they have such control, disregarding their understanding is unwise.
If, for example, data from members of one ethnic group were used, without the members’ knowledge or consent, in an effort to demonstrate that group's inferiority or predisposition to stigmatizing diseases or conditions, it seems both reasonable and, indeed, predictable that those members might object, as they have in several cases involving biospecimens.53 Causing distress in patients who learn only after the fact that they've become research subjects seems an ethical breach; it also seems likely to result in bad public relations and contentious politics for genomic science.
Patients are not (automatically) research subjects
The ‘patient’ who passively places her health in the hands of a well-intentioned physician is a concept dating to antiquity. The ‘human subject’ who makes an informed affirmative choice to subjugate her own interests to those of science is a relatively modern construct. Not until the mid-twentieth century did organized bodies begin to define different ethical norms for medical care and human research, reflecting a growing understanding that research alters the physician-patient relationship (although the distinction between research and treatment can be blurred in areas such as oncology, where many patients are placed on protocols as a means to access investigational drugs).
The World Medical Association's Declaration of Helsinki, published in 1964, along with its predecessor, the Nuremberg Code (of 1948), changed the landscape of medical research profoundly, eventually informing new legal protections for human subjects in many countries, including the USA.54 The Code is a widely cited appendix to the US military court's judgment in criminal trials of those responsible for horrific Nazi human experimentation; 16 years later, the Declaration expanded the Code's principles, making more explicit the obligations of physicians who conduct human research.
Both the Code and the Declaration assume that research introduces new risks and conflicts of interest to the physician–patient relationship; beyond informed consent, both documents also establish criteria for the research itself, such as societal value and risk minimization.55 But the Code addresses human experimentation, not data privacy, while the Declaration, even in its most recent, seventh revision in 2013, mentions data research only in passing, concerning itself little with the circumstances under which patient records might become research data. (It does, however, require that ‘[f]or medical research using identifiable human material or data, such as research on material or data contained in biobanks or similar repositories, physicians must seek informed consent for its collection, storage and/or reuse’.56) The Code also assumes that data may be rendered ‘anonymous’—an assumption that seems dangerous in our modern era of population-based genomic research.57 Moreover, neither the Code nor the Declaration anticipates a world in which technology and big data make it possible to render every patient an involuntary subject of genomic research.
Medical research guidelines issued in the 1980s by the Council for International Organizations of Medical Sciences (CIOMS), in collaboration with the World Health Organization, further refined the ethical obligations of biomedical researchers.58 Although these guidelines reflect the same overly sanguine assumptions about the effectiveness of de-identification and anonymization, as last revised in 2002 the CIOMS guidelines do distinguish sharply between patient and subject data, prescribing different standards for secondary research involving the records of consenting subjects and research involving the records of patients, where privacy expectations are greatest. The guidelines advise that when medical records will be disclosed for research without consent, providers should always notify patients, and should honor specific patient requests not to participate.59
As we discuss further below, US regulations pertaining to research and medical privacy also distinguish between patients and subjects, providing for IRB review, consent, and HIPAA authorization when researchers transform ‘patients’ into ‘subjects’ by using identifiable patient data for research. These regulatory schemes do permit waiver of patient consent and authorization when certain criteria are met, but arguably do not permit researchers to override the wishes of patients who express a desire to opt out of research use.60
Current practice affords less than full disclosure to data subjects
What do patients understand and believe about how clinical data is used and disclosed for research? Most probably don't have an informed opinion, because there is no legal requirement that patients be given specific information each time their providers disclose records for research—unless the patients themselves know enough to ask the right questions.61 The federal medical Privacy Rule does require providers to give patients a ‘Notice of Privacy Practices’ (NPP), but with respect to research, a provider can satisfy the regulation by simply stating in the NPP that the provider ‘may use and share your information for health research’, and then obtaining a waiver of the Privacy Rule's patient authorization requirement or using a DUA when disclosing data for specific projects.62
To the extent that any patient actually reads the NPP in its entirety, the required disclosures are quite vague and non-specific, and fall far short of conveying any sense of the sheer number of people, including third parties, who will be given access to patient information for records research—much less disclosing anything about the research itself.63 The only way that a patient can learn which researchers are studying her medical records is to ask the provider for an ‘accounting of disclosures’—and even such an accounting is limited in scope. Under the Privacy Rule, an accounting covers only the prior six years, is often not study specific, and includes only research involving ‘individually identifiable health information’ as defined by the Rule.64
This last limitation matters most, because a provider would not need to include disclosures of genomic data in a Privacy Rule accounting if such data are not considered identifiable health information. Typically researchers characterize genomic data as ‘de-identified’ information, and federal regulators have not objected. The research community has long operated as though a unique DNA sequence is not an identifier per se—unlike a fingerprint, driver's license number, or URL, each of which are enumerated identifiers under the Privacy Rule.65 Genomic data reside in an identifiability gray zone: while most researchers and policymakers have acknowledged that gene sequences could in theory be re-identified, linking the data with the DNA source, they have maintained that the magnitude of this risk is small, so small that it doesn't warrant requiring informed consent for data use or oversight by federally regulated Institutional Review Boards (IRBs).66
We disagree, and we argue, as have other commentators, most notably George Church, leader of the Personal Genome Project, that it is no longer ethically defensible or legally sound to maintain that gene sequence data are anything other than identifiable health information.67 For WGS data, the research community should dispense with the hair-splitting nuances of federal regulatory schemes that attempt gradations of ‘identifiability’ in favor of a best practice that recognizes re-identification risk increases with time and patients are best protected if we treat their genomes as identifiers, now and in the future.
We are equally concerned about the transfer of gene sequence and medical data obtained in the course of clinical care to federal, commercial, and other third party academic medical center databases, without meaningful disclosure to the data subjects. We believe that at a minimum, patients and subjects should receive specific notice that this use of their genomes or other medical information can and does occur.
In the coming era of personalized genomics, we see patients’ privacy expectations colliding with the growing demand in academia and industry for genomic data, and with the ‘permission optional’ culture of medical records research. Patients, conditioned by both deep cultural beliefs about doctor–patient confidentiality and the more recent federal Health Insurance Portability and Accountability Act (HIPAA) paperwork to believe that medical privacy is their right and their provider's obligation, will be worried—even angered—to learn how extensively their genomic information is used and shared for research without consent, and how variables are the current data privacy and security practices in research.
Of course, patients can only object to the research practices of which they are aware. We think such awareness is inevitable, and that it may come about in one of two ways: either the research community launches a frank and open dialogue with the public, explaining the benefits of genomic research and proposing uniform standards to protect privacy interests, or the issue will surface in an inflammatory context such as a major security breach, prompting restrictive policies that neglect the immense value of the new knowledge emerging from this work.68
Precisely because the research is too valuable to jeopardize by risking a public backlash and ill-considered legislative or regulatory measures, we hope to spark that open dialogue by proposing standards and norms for the research use of clinical gene sequence data in the EMR.
RECORDS-BASED RESEARCH TODAY
This section of the paper looks first at the relevant current legal rules. The paper then examines the current ethical and legal practices for secondary records research, noting where current practices may diverge in spirit or effect from the stated intent of the ‘rules’.
For the purpose of this paper, the three sets of current legal rules are important: medical record ownership, research subject protection, and health information privacy.
Who owns the medical record?
Patients would be surprised to learn that they don't own the medical records that their providers maintain; whether paper or electronic, these records are generally viewed as a business asset owned by the patient's provider (or that provider's employer).69 While federal and state medical privacy laws give patients certain rights of access to their providers’ medical records, these laws don't confer ownership of the records, or even full, traditional ‘privacy’ rights, because they don't allow patients to control how such records are created, used, or shared, except under narrow circumstances.70
Instead of a basis in true privacy or property rights, the ‘privacy’ regime in health care comprises series of state and federal statutes and regulations offering what could more accurately be described as confidentiality protection: covered providers (and their vendors and contractors) are required by these laws to preserve patient confidentiality by maintaining medical records securely and disclosing identifiable information only for legitimate purposes, and subject to certain controls.71 The protection regime focuses on the provider's record: once information from this record is no longer under the control of the covered provider—for example, once it is in the hands of third party researchers—it is largely beyond the reach of most medical records privacy regulations.72
Legal protections for human subjects
In the USA, two federal regulations are the primary source of protection for human research subjects, but each regulation is limited in scope, and only one, the Federal Policy for the Protection of Human Subjects (‘Common Rule’), addresses the secondary use of clinical data.73 The Common Rule dates to 1991, prior to the significant use of EMRs, and extends only to research (a) funded or( b) conducted by the DHHS (which includes NIH-funded research) or by other federal agencies that have adopted the Rule by regulation or (c) by federally funded entities that elect to extend the Common Rule to all of their research). (Eighteen federal agencies follow the Common Rule.)74 Common Rule agencies require institutions receiving federal grant money (eg research universities) to file a Federalwide Assurance certifying that the grantee complies with federal human subjects protection policies; grantees whose assurance is suspended or revoked for non-compliance may no longer spend federal grant funds.75
The Common Rule generally requires, among other safeguards, that grantees obtain IRB review and seek participants’ informed consent when research will require an intervention with a subject or, importantly for this article, will involve the investigator obtaining what the Rule defines as ‘identifiable private information’ about living individuals, unless the research qualifies for one of several categories of exemption.76 The Common Rule exempts data research from these protections if the investigator otherwise has legitimate access to the data (eg is a physician studying her own patients), and will not record identifiers for the research.77 The regulation defines private information as ‘identifiable’ if an investigator may ‘readily ascertain’ the identities of the data subjects.78
Most relevant to data research, the Common Rule permits an IRB to waive participants’ consent if the research risks are minimal, the waiver would not adversely affect subjects’ rights and welfare, and the research could not practicably be carried out without the waiver.79 When an IRB approves an EMR-based research study (which typically involves many patient records), that IRB will almost invariably waive subjects’ consent and authorization as being impracticably expensive and time consuming to obtain. (The version of the Common Rule adopted by the Federal Food and Drug Administration targets research involving FDA-regulated products and does not contemplate this kind of records-only research.)80
In July 2011, the DHHS issued an ‘ANPR, signaling an intent to make extensive changes to the federal Common Rule for the Protection of Human Subjects.81 The subsequent Notice of Proposed Rulemaking, published on September 8, 2015, generated a flood of comments, with many academic medical institutions focusing on the practical implications of a proposed consent mandate for biospecimen research.82 The future, and eventual terms, of these proposed amendments remains uncertain, but one provision, if adopted, could have a significant, if largely unnoted, effect. Few commenters, whether from academic medical centers or elsewhere, paid attention to the proposal to exclude from human subject protection regulations entirely any research use of identifiable health information governed by the HIPAA Privacy Rule.83
With this one provision, barely referenced in the preamble to the Proposed Rule (which noted simply that the researcher and the provider must both be covered by the rule), DHHS would effectively deregulate and remove from IRB review almost all EMR-based research conducted by covered entities. Note that researchers who are not otherwise covered by the Privacy Rule (and would therefore remain subject to the Common Rule) could become ‘covered entities’ for the purpose of accessing a covered entity's EMR, simply by providing some service (such as abstracting data from the EMR for the researcher's own study) to the HIPAA-covered entity and signing a HIPAA ‘business associate agreement’ with that entity.84
The proposed exclusion from the Common Rule of secondary records research leads us to conclude that the NPRM, if it became a final rule, would not impose any significant regulations relevant to our topic. We recognize, however, that until the DHHS publishes a final rule we cannot be certain—for example, it could adopt the suggestion (acknowledged in the NPRM text but not proposed as a change to regulatory language) that gene sequence data be defined to be identifiable under both the Common Rule and HIPAA.
Legal protections for medical data privacy and security
The federal medical Privacy Rule, promulgated under the Health Insurance Portability and Accountability Act of 1996, restricts how ‘covered entities’ (eg most providers and insurers) may use and disclose ‘individually identifiable health information’ for research. In comparison to the Common Rule, the Privacy Rule might appear to broaden privacy protections in data research. The Privacy Rule extends beyond federal grantee institutions to all US entities that transmit health data electronically for a covered purpose (almost all providers and health care institutions, as well as insurers). The Privacy Rule also defines ‘identifiable’ more broadly than the Common Rule, which protects the subjects of private information only when an investigator can readily ascertain the identity of those data subjects. The Privacy Rule, by comparison, protects all health information held by a covered entity when there is a reasonable basis to believe such information can be used to identify an individual, even if not ‘readily’.85 Before a covered entity may use or disclose this protected health information (PHI) for research, the entity must obtain each data subject's written authorization.86
Importantly, however, the Privacy Rule contains its own waiver provisions, with criteria resembling those in the Common Rule, and in addition to waiver provides several other routes for a covered entity to use or disclose information to researchers without any form of patient permission. The first is a ‘de-identification’ regulatory safe harbor, under which the covered entity may treat health information as completely outside the scope of the Privacy Rule's protections if the entity removes 18 enumerated identifiers (ranging from name and zip code to URLs and biometric identifiers) and has no ‘actual knowledge’ that the remaining data could be re-identified; alternatively, the entity must obtain certification from a ‘statistical expert’ that for the combination of elements in a given data set, the probability of re-identification is ‘very low’.87 The covered entity also may elect to create a ‘limited data set’ by removing specific ‘direct identifiers’, such as name and Social Security Number, and may then disclose the remaining data to researchers under a ‘data use agreement’ that contains terms specified in the Privacy Rule.88
The HIPAA Security Rule, a companion regulation, applies to electronic PHI (ePHI), and requires covered entities to adopt administrative, physical, and technical safeguards to protect ePHI from unauthorized access and maintain the integrity and availability of ePHI.89 The Security Rule's standards govern how a covered entity stores and transmits any ePHI that entity maintains for any purpose, including research. Both the HIPAA Privacy and Security Rules apply in addition to any existing state laws pertaining to medical records privacy. Importantly, genomic information, if not deemed identifiable, need not be maintained in an electronic form that meets Security Rule standards.
Federal and some state regulations also require ‘breach notification’ in the event of certain data breaches, mandating covered entities to notify consumers and the government of large, unauthorized disclosures of identifiable personal information that have the potential to cause harm (eg disclosures of unencrypted data containing identifiers.)90 The HIPAA breach notification regulations apply to defined breaches of all ePHI, but typically state breach laws define covered information more narrowly, limiting notification to disclosures of breaches of information associated with a direct identifier, such as a name or Social Security Number. (State regulators may, however, have the power to impose substantial fines for certain data breaches).91
Lastly, an evolving landscape of class action litigation has created liability-related incentives for hospitals and physician practices to maintain the privacy and security of clinical data. The litigation climate for security breaches is unsettled, with plaintiffs pursuing new theories of liability in the wake of large, highly publicized data breaches. While the elements of a successful claim are not yet clear, it is evident that large providers in states with more consumer-friendly breach statutes have begun to enter multi-million dollar settlements in class action cases.92
Each of these legal protections, whether for data research, data security, or data breach, is available only to data meeting various standards for identifiability. Unless genomic data receive this designation, unauthorized uses and disclosures of patient genomes will not incur legal penalties or civil liability.
As a general matter, a given element of personal data is protected only to the extent that a given law or rule defines the term ‘identifiable’ to include that element, but inconsistent legal definitions of identifiable—and inconsistent, sometimes equivocal guidance from federal agencies—cloud the status of genomic data. Moreover, perhaps due in part to the absence of any private right to sue under the Common Rule or the HIPAA Privacy Rule, there has been little if any judicial interpretation of ‘identifiable’ in these contexts.
The multiple meanings of ‘identifiable’
The federal Common Rule, drafted in the 1980s in an era of paper medical charts, deems information individually identifiable only if the identity of the subject may be readily ascertained by the investigator or associated with the information.93 Re-identification science and electronic data mining were not anticipated by regulators of the Common Rule era. Federal regulators have attempted in guidance documents to define the circumstances under which information is considered ‘Common Rule’ identifiable, but in so doing have highlighted the different standard for identifiability under the HIPAA Privacy Rule.94 In the preamble to the recent NPRM for the Common Rule, regulators considered but appear to have rejected the possibility of harmonizing these regulations by adopting the Privacy Rule standard for identifiable information.
The HIPAA Privacy Rule, written after the advent of electronic HIE and as a result of legislation expressly focusing on such records, extends the definition of identifiable to any health information where there is a reasonable basis to believe it can be used to identify an individual. Among the data elements that the Privacy Rule specifies as de facto identifiers are any ‘biometric identifier’ and any ‘unique, identifying number, characteristic, or code’.95 This more sweeping definition would seem, on its face, to include the genome, the ultimate biometric, a truly unique (but for identical twins) identifying characteristic and code. DHHS has not taken a position either way on that argument.
If there were any room for doubt, the HIPAA Privacy Rule, in its ‘safe harbor’ provisions, states that even after removing all 18 of the enumerated identifiers, a covered entity must treat the remaining data elements as protected information if that entity has ‘actual knowledge’ that a recipient could re-identify the information. What this means is at the heart of the genomic privacy debate in research.
The National Institutes of Health, most plainly in a recent NIH policy document on genomic data sharing, consistently states that genomes are de-identified information.96 NIH continues to hold this position even though it both has imposed increasingly stringent security precautions for access to the genomic data that it collects and maintains, now treating these data as potentially identifiable by requiring investigators who access the dbGaP genome repository to sign DUAs containing confidentiality, security, and access restrictions.97
The federal regulators who oversee Common Rule compliance for the DHHS at the Office for Human Research Protections (OHRP) have not challenged the scientific practice of assuming genomes are de-identified and conducting secondary genomic research without consent or IRB review. Thus far, OHRP has not publicly questioned the NIH position that whole genome sequence data are de-identified, and therefore their use does not constitute ‘human subjects research’ under the Common Rule. This is true even though the NIH reportedly has, in resisting compulsory disclosure of dbGAP data under the Freedom of Information Act (FOIA), argued that disclosure of such data would be an invasion of subjects’ personal privacy.98
The federal DHHS's Office of Civil Rights (OCR), which interprets and enforces the HIPAA Privacy Rule, has been somewhat equivocal, if not cryptic on this topic. OCR has stated in its guidance on de-identification that an ‘identifying characteristic or code’ is one that would currently allow for re-identification.99 With respect to the question of when a provider has ‘actual knowledge’ that data may be re-identified (thereby negating the safe harbor), ORC guidance states that the mere publication of re-identification techniques is not sufficient to meet this standard—leaving open the question of what kind of knowledge would suffice.100
DISTINGUISHING RESEARCH FROM OTHER SECONDARY USES OF MEDICAL RECORDS
Research is not the only—or even the most common—use of EMR data beyond the direct provision of care to patients. Healthcare providers routinely use and disclose information from their medical records, often in identifiable form, and without patient consent, for purposes that include a broad category of health care business activities such as billing, accounting, finance, strategic planning, and quality improvement (collectively termed ‘healthcare operations’ by federal privacy regulations);101 as well as for state and federal public health activities and to satisfy document demands from regulators, law enforcement, and litigants. We describe those other disclosures briefly, largely in order to distinguish the issues they raise from those involved in research.
Under the rubric of ‘health care operations’, the federal HIPAA Privacy Rule permits entities covered by the Privacy Rule (most health care providers, insurers, and pharmacies) to use or share identifiable patient information without consent to provide treatment. Covered entities may also use identifiable patient information internally, without consent, as necessary to conduct normal business operations, such as to obtain payment, process claims, or assess the quality of care. And finally, the Privacy Rule also permits ‘covered entities’ to disclose or share patient information, also without consent, with other covered entities for treatment or reimbursement purposes, and with vendors and contractors who sign an agreement (known as a HIPAA Business Associate Agreement) containing certain privacy and security obligations.102
One example that illustrates the scope of these ‘TPO’ (treatment, payment, and health care operations) disclosures is the electronic HIE. Through an HIE, providers across a state or region can network much of their clinical data for purposes that initially were treatment focused, but are now expanding to include research. HIEs may be public or private, though many were initially funded and facilitated by the federal HITECH Act. Some states have established HIEs on the state level, but third party operator entities have also moved to aggregate providers and health systems in multi-state HIE consortia.103 No specific notice to patients is required when a provider or facility participates in an HIE, although a minority of participants do seek prior patient consent. Most HIEs provide an opt-out for those patients who somehow learn of the HIE, object to data sharing, and contact the HIE operator directly.104
Beyond HIPAA, some state and other, narrower federal laws further restrict a provider's ability to use and disclose sensitive information, such as records of HIV or substance abuse treatment, without patient consent. To the extent they offer greater privacy protection, these laws are not preempted by HIPAA.105
Public health activities
State and federal government agencies, from the CDC to state health departments to the US Food and Drug administration, also routinely collect identifiable patient information abstracted from medical records for what are termed ‘public health activities’. These uses range from tracking disease incidence to evaluating prevention programs or investigating adverse events related to drugs or medical devices. The Privacy Rule specifically permits these disclosures without patient consent, although, as mentioned above, patients may request that a provider or other covered entity provide an ‘accounting’ of instances in which that patient's identifiable health information was shared for certain purposes, including public health, during the past six years (though anecdotal evidence suggests few patients are aware of or exercise this right).106
Compliance and law enforcement
With certain procedural restrictions, the federal medical Privacy Rule also provides a pathway for providers and other covered entities to release identifiable information to federal agencies such as the Center for Medicare and Medicaid Services (for Medicare-related billing, quality, audit, and other purposes); to other federal agencies performing audit or investigation functions; to federal, state, and local law enforcement; and to private litigants.107 This could include, for example, releasing information to the police as part of a criminal investigation or to counsel in personal injury case who is seeking information to use against a party in that case. The Privacy Rule sometimes requires legal process, such as a subpoena, before a covered entity may make compliance and law enforcement disclosures.
Why research is different
Does research differ in any material way from these myriad other uses of medical records of which most patients are unaware, and over which patients exercise may little or no control? In many respects, the answer is no, but there are two important exceptions. The first caveat is that researchers who obtain information from patient records may operate outside the governance and regulatory and contractual confidentiality obligations that apply to providers and insurers (and to their contractors), to federal agencies, and (to some extent) to state law enforcement and civil litigants.
While these legal requirements help to raise the bar for data security among operational and government users of EMR data, they aren't a fail-safe; a recent report estimates that one in 10 US citizens has been affected by a breach of medical records security involving a provider or its contractors.108 But in the absence of mandated safeguards or even agreed-upon standards, data privacy and security in research turn on whether individual investigators understand and implement encryption, access controls, firewalls, and other basic electronic data safety measures. As a commentator noted in the journal Nature, the genomic information collected for research ‘is supposed to be highly protected [but] it is disseminated to various institutions that have inconsistent security and privacy standards … data protection often comes down to individual scientists.. [o]nce leaked, these data would be virtually impossible to contain’.109 It is important to note that, as discussed above, even if adopted the proposed changes to federal research regulations would likely not change this analysis. The proposal, which include unspecified security standards, would not apply to most secondary research using EMR clinical data, because the revisions would largely remove the secondary use of HIPAA-covered data from the Common Rule.110
The second and more important caveat is that research is not something the patient is required, either legally or practically, to participate in. A patient must accept some uses and disclosures of information for a health care provider to operate a health care business or to respond to governmental demands. Each medical records research project conducted without consent, however, could be viewed as an elective intrusion upon patient privacy. Even though these intrusions may ultimately benefit this patient, or other patients, they are different from the trade a patient must make when giving up some privacy to access health care.
This elective aspect distinguishes research ethically from some of the other routine uses of medical records. The weight that we give this distinguishing factor may vary, but we can't disregard it. And, especially for research involving clinical genomic data, before we assume that patients will support this use unquestioningly, we must honor that ethical distinction, conveying the full scope of the privacy intrusion and explaining the limits of any assurances we make about confidentiality.
It could be argued that permitting the broad use of medical records for research should be a public duty, like providing evidence, mandatory vaccinations, compulsory education, or paying taxes. Effectively, such legally authorized conscription of medical data for research already exists, to the extent that the federal Common Rule and Privacy Rule permit waiver of individual consent and authorization for medical records research without prior specific notice to patients. Although we, as coauthors, may disagree about the proper (and practical) scope of waiver in the research context, we both believe EMR research involving genomic data implicates privacy and security practices that exceed current norms.
ESSENTIAL RESEARCH VERSUS INDIVIDUAL RIGHTS
The potential for widespread EMR research using genomic data threatens a direct confrontation between the needs of research and the rights and interests of patients. In this section, we contend that this impending conflict requires special attention and possibly exceptional responses.
EMR research using genomic data is important
It is easy to see where the interests of all parties to EMR research align: Patients, providers, payors, and researchers all benefit when well-designed, ethically conducted studies produce useful new knowledge. From diabetes to cancer to infectious disease, much of what we are now learning about population disease risk and health outcomes—knowledge that currently improves care for millions of patients—results from researchers mining clinical data in EMRs.111 Adding genomes to this data mining effort creates a potent scientific tool that should lead to a better understanding of disease, and, ultimately, more effective, efficient treatments.112
Researchers themselves have a direct and substantial interest in maintaining their access to immense volume of valuable clinical information stored in the EMRs of providers and health systems. Providers use the findings of EMR-based research to set practice standards and make evidence-based treatment decisions. Payors now use EMR-based research to decide whether treatments work and are cost-effective. Taxpayers, who subsidize federal payors such as Medicare, Medicaid, and the Veteran's Administration, have a decided economic interest in supporting the kind of EMR research that creates a sound evidence base for reimbursement decisions.
EMR research using genomic data requires higher standards
But do these interests, in the aggregate, outweigh the individual patient's autonomy interests—interests that traditionally we attempt to honor in research? Some bioethicists have argued that they do, proposing that when medical records research involves minimal risks, everyone who is a patient has an ethical obligation to participate.113 Whether or not that is a compelling ethical argument, no express legal obligation currently exists (though, as noted previously, current laws permit the conscription of much patient information for research through a waiver process).114 But even if there were such an obligation, we can ask whether research involving clinical genomic information is different in ways that justify an exception to this obligation principle.
The policy argument over genetic exceptionalism reflects conflicting views about whether genetic information differs in important ways from other clinical information, and deserves special protections. Some states have endorsed this view, singling out genetic testing in confidentiality statutes and non-discrimination statutes. In contrast, federal regulators rejected this approach when drafting the HIPAA Privacy and Security rules in 2000, refusing to declare genomes to be categorically different from other health information. This latter view is not uniform across federal legislation: The federal GINA, though not a confidentiality statute, specifies that genetic information is a special category of health data that health insurers and employers may not use in coverage or hiring decisions.115
We think that among the many types of health information, several characteristics make genomic data especially, if not uniquely, sensitive. Like biometric identifiers, dense genomic datasets are unusually subject to re-identification, they can reveal sensitive family and ancestry information, and they predict current and future health concerns to an extent that is, at least currently, collectively unclear and almost completely unknown to any individual. Perhaps most importantly, people believe genomic data are sensitive, and at least in some contexts (eg FOIA, as noted above), government entities appear to agree. By recognizing the dynamic, uncertain quality of re-identification risk and the near consensus that genomic data have some special sensitivity, we can address the tension between individual and collective interests by focusing, not on patient obligations, but on the obligations that should accompany the use and disclosure of clinical genomic data for research.
This does not mean, however, that we dismiss the privacy risks associated with the secondary use of other types of clinical data. For example, we think the widespread sharing of three-dimensional cranial MRI and CT datasets for research with few (if any) controls on data use poses a current and not insignificant risk to the privacy of the patients whose images are shared. Very little skill is required to use open source software to render a facial image from such a dataset (one could do this on a home computer); recent work suggests that with the help of facial recognition software, such renderings can be matched correctly to subjects’ photographs in nearly one third of comparisons.116 Though beyond the scope of this paper, the identifiability of imaging data is a research privacy problem that providers and imaging researchers should take seriously.
EMR genomic research may be a special case
Much medical records research will never be conducted with patient consent; many argue that for practical and scientific reasons, it can't be. In fact, regulators and commentators are entertaining proposals to eliminate consent for EMR research, or to simply deem such uses of EMR data ‘healthcare operations’ and therefore not research, thus removing them altogether from requirements for research oversight.117
But use of patients’ genomic sequences, which we believe to be identifiable within the plain meaning of that term, should be a special case.118 We do not believe claims that all secondary-use genomic research involves minimal risk to the data subjects, although the existing practices in effect treat all of it as such. The heightened potential for re-identification of genomic data and the inherent sensitivity of such data are compelling reasons to distinguish WGS data studies from other medical records research, and to afford patients’ autonomy and privacy interests greater respect than is the current practice.
We also argue that providers and researchers have an equally compelling, if less-often noted, interest in prioritizing patient autonomy and choice in especially sensitive areas of research. For economic reasons, providers must be concerned about meeting patient expectations and reducing liability exposure. Honesty and transparency about records disclosures should make good business sense—at least to the extent that such practices become industry norms.
Perhaps most pragmatically, genomics researchers will need continued public support, both for funding and for access to medical records. For EMR-based research, scientists’ access to data will depend on providers’ willingness to open their records; a change in public sentiment, prompted by revelations that genomes are disclosed to researchers without consent or IRB oversight, could affect that willingness dramatically. We have seen recent examples of popular backlash against unconsented and unknown research, from the Havasupai lawsuit against unexpected uses of health information and DNA samples given for diabetes research to lawsuits by parents in Texas and Minnesota over undisclosed research using their children's neonatal blood spots.119 And in response to such revelations, would researchers really argue to patients that, although their fingerprints and even their URLs are identifiers under federal law, their genomes are not?
V. FINDING A BALANCE: WORKABLE PRACTICES THAT RESPECT PATIENT RIGHTS
Legal mandates for privacy protection can usefully set a floor for conduct and enable the government to single out extraordinarily bad or negligent behavior for sanction. But as a means to establish best practices, laws and regulations have significant limitations: in genome-related research, just as in the financial services industry, the law—especially the protracted rule-making process of regulation—can never keep pace with innovation. Legislative responses can be backward looking and inflexible. Laws, regulations, and legal precedent are, for the most part, jurisdiction specific, while today's genomic research can involve international collaborations and multi-national corporations.
The better, more nimble, and more far-reaching approach is voluntary, but normative. Although the Privacy Rule (and proposed changes to the Common Rule) gives them the latitude to do otherwise, health care providers and the research community should adopt a common set of best practices to govern use and disclosure of genomic information created for clinical purposes. Professional societies, academic institutions, and major provider entities who publicly endorse consensus best practices have the power to create a de facto standard of conduct that evolves, flexibly and organically, with advances in science and technology.
There is precedent for such an approach in the embryonic stem cell research committee (ESCRO) structure first proposed in 2005 by the National Research Council of the Institute of Medicine, with the goal of addressing emerging controversies in the largely unregulated area of human embryonic stem cell research.120 Many institutions have altered the NRC's procedural recommendations in favor of a more efficient review process, but the core proposals still garner praise as an example of successful scientific self-governance.121
Voluntary standards have also been proposed for international genomic database research: in 2009, the Organization for Economic Cooperation and Development, a member organization comprising 34 countries (including the United States), published guidelines to govern research involving biobanks and databases of genomic information.122 These standards are stated in broad terms, but include IRB (or, internationally, ethics committee) review of most secondary uses of genomic data, and would require data sharing agreements and specific protocols for data access and protection. Similarly, Knoppers et al., on behalf of three international genomics research organizations, have published a data sharing Code of Conduct for international research collaborations.123 Neither of these standard sets specifically addresses the secondary use of genomic data from medical records, although both guidelines recognize the need for policies that extend beyond the use of data collected in the context of a research protocol.
With these models in mind, and building upon this prior work, we offer the following proposed standards to govern research use of clinical genomic data.
First principle: avoid surprises
In 2006, the United Kingdom's Academy of Medical Sciences studied how British researchers use National Health Service medical records. The AMS concluded that the existing NHS goal—to seek patient consent whenever records could not be anonymized—‘will never be feasible for much research using patient data’.124
Arguing that anonymized data simply isn't useful for much research, and further, that British law allows researchers to use identifiable medical records without consent under defined circumstances, the AMS also endorsed what one commentator called the ‘no surprises’ principle: don't assume that the public understands and agrees; instead, reach out to inform patients and then study their attitudes and preferences, using what you learn to inform policy decisions.125
But asking the question means risking an unfavorable response. A recent UK study of patients in NHS outpatient clinics found that when asked, only 14 per cent supported use of their identifiable records for research, while 18 per cent would not permit research use even if their records were de-identified.126 In the USA, although studies suggest that patients do support the general concept of medical records research, when it comes to their own data, patients expect to be informed; many also want the opportunity to consent.
One of the best studies of US patient expectations, conducted in 2010, found that more than two thirds of patients who had donated DNA for genetic research did not want their genomic data shared with the federal dbGaP database without their express consent.127 This large survey involved elderly patients who had already joined a longitudinal, NIH-funded dementia study and had a lengthy relationship with the investigators; the authors note that a younger, more diverse sample of patients who have never participated in research might feel even more strongly about consent.
Quite possibly, despite glancing at the HIPAA NPP in their physician's office, few patients realize that their identifiable data, much less their genomes, could be disclosed outside their local clinic or hospital and used by researchers other than their own providers. On the basis of the few studies of attitudes and preferences conducted to date, however, it seems clear that patients do want to know. Whether one sees that fact as ethically important, practically important, or both, it clearly should be important.
So, in terms that they can understand, providers must tell patients that this happening, and explain why. The fact that IRBs routinely waive patient consent (and HIPAA authorization requirements) for EMR research, on the grounds that seeking consent is impracticable for large samples, does not justify the failure of researchers and providers to give patients any meaningful notice of how (and how often) identifiable medical information —particularly genomic information—is used and disclosed for research.
Nor can we justify this failure to inform by resorting to the argument, advanced by some bioethicists, that patients have an ethical obligation to participate in medical records research—even if, under some circumstances, we agree.128 Such an obligation, even if it exists, would not be an obligation to participate blindly, with no awareness of scope of the privacy risk or the scale of the potential benefits of the research. Unlike the HIPAA NPP, notice to patients about EHR genomic research should be informative.
Meaningful notice to patients could take many forms, but the simplest approach might be an electronic roster, maintained at the provider's website, of all studies to which the provider has disclosed genomic data, along with each investigator's contact information. Such a notice would contain information that patients are already entitled by to receive under federal law, but which few actually do receive unless they are aware of and exercise their right to request a HIPAA ‘accounting of disclosures’ from each of their providers. An electronic roster of data studies might also help patients and institutions to hold investigators accountable for data security, by making public information about which third party researchers are holding genome sequence data initially created for clinical purposes. The existence of the roster could be disclosed to patients in person or by email, mail, or the telephone, in addition to being present on the institution's website.
Provide more information, not less
We know that genome sequence data in the EMR will likely be used for research one day, even if it isn't possible to know by whom, or for which studies. What, then, should a physician who orders genomic sequencing for a diagnostic purpose tell his or her patient about this eventuality?
Patients deserve more than a generic statement that their medical information may be used for research. But the providers who are ordering gene sequence tests may have little or no information to share about particular studies involving EMR records. Increasingly, decisions about which data to export from the EMR and for what purpose are handled centrally within large medical centers and health systems, so providers in those environments may not even know when data about their patients are released to researchers.
What providers do know, and can tell patients at the point of care, is that research using patient records is now common and can be an important tool for discovering new relationships between genes and health. Physicians can tell patients that they share records because new research findings can improve the quality and cost-effectiveness of medical care. Providers must tell patients that it will not always be possible to ask for consent, but they can reassure patients that through DUAs and other legal means, any researcher receiving genomic or other potentially identifiable information from your medical records will obligated to protect the security and confidentiality of those data.
Yet providers should not promise absolute confidentiality. Patients should know that their genomes are unique and can't be made anonymous. Providers should also help patients understand that research findings developed through the use of EMR data may be too new and uncertain for medical use and therefore individual results will not, in most cases, be returned to patients.
Consider asking for permission and offering patients control
Consent is one of the biggest ethical challenges for EMR-based genomic research. Once the research community stops insisting that it is reasonable and ethical to treat genomic data as either ‘de-identified’, ‘anonymous’, or ‘not readily identifiable’, then in most cases federal regulations (and some state laws) dictate that researchers must obtain patients’ consent—or an IRB waiver of consent—before using these data for research.129
When medical records research involves identifiable information, IRBs often agree, consistent with regulatory criteria, that it would be impracticable to contact thousands of subjects to ask permission, and further, that the potential for response bias (differences between the health or demographic characteristics of those who consent and those who refuse) might compromise the validity of the study. The power of these justifications has made consent waiver routine in records research, to the point where millions of patients are currently the subjects of such research without having any idea that this is the case.
Innovations in the EMR space could disrupt the consent waiver paradigm by undermining the impracticability argument. One feature of many EMR systems is a ‘patient portal’, through which patient and provider may exchange information in a secure, encrypted communication. Portals are also a means for providers to push information to their patient population, and for patients to respond to satisfaction surveys or indicate preferences related to their care.
The patient portal could also be a way for patients to record their preferences about participating in genomic research. Because patient portals interface with the EMR, researchers using the EMR can identify those patients who have either given global consent or opted out of research participation, without the need to contact any patient directly. Several patient advocacy groups such as the Genetic Alliance and Autism Speaks are constructing a similar form of patient portal, with the goal of giving their members more control over the use of their samples and genomic information; conceivably such existing systems might be programmed to interface with Epic and the other major EMR systems.
We believe that providers who are using EMR systems should move toward allowing patients to document their willingness to participate in genomic research via use of a portal-based general permission form. We argue that this documented permission should not be equated to consent under federal research regulations, or to HIPAA authorization, because the point-of-care based process will be too prospective and attenuated to meet these strict regulatory standards. There are also clear ethical limitations to seeking general consent for unspecified future research: most obviously, that patients can't make a fully informed decision about uses and risks not yet identified by investigators or IRBs.
And the challenges of implementing an EMR-based permission system are greater than they might first appear. When patients receive a preference form through an EMR patient portal, the burden is likely to fall on the primary care provider—the point of contact with the patient—to answer questions about risks and benefits of unspecified future research. The time constraints of the primary care setting dictate that any preference form be short and easy to read, so it is unlikely that the process or documents could meet the extensive consent requirements of federal research regulations. It may well make sense for institutions to set up alternative contacts for questions about this research permission.
Despite these limitations, and recognizing that the process may not meet all regulatory standards for research consent and HIPAA authorization, we still believe that asking permission for research use at the time of clinical testing demonstrates respect for patient rights and autonomy.
Importantly, however, an ethical permission process must inform patients that there are circumstances when permission cannot or will not be sought.
Be honest when permission isn't possible
Even if it were possible to give every patient the ability to log into a portal and record his or her preferences about research use of the EMR, there will still be instances in which patients’ clinical genomic data are used and shared for research without permission.
It simply isn't possible for a provider to apply patient preferences retroactively when patients’ DNA and data have already left the control of the provider's institution. Further, providers’ pathology departments and clinical laboratories still share ‘de-identified’ clinical specimens for research, especially in academic medicine, and not infrequently for gene sequencing studies, without any requirement to document which specimens were shared, or with whom.
We can advocate against this practice, and cite a 2012 proposal by the federal OHRP that all biospecimen research be, at a minimum, conducted in a traceable, secure manner (ie be registered with an IRB and subject to data security standards), but it will take time to change long-standing attitudes and expectations about the free exchange of ‘de-identified’ biospecimens. As a result, a patient who has ever had pathology or clinical laboratory testing can't be sure that her biological materials—or her medical information derived from them, including her genomic information—won't be used for research. Nor can her provider.
The provider might offer the patient choices and some degree of control over the use of EMR genomic data that the provider has not yet disclosed, but any permission form must explain the circumstances under which patient preferences will not or cannot be respected.130 For example, if the provider participates in an HIE whose laboratory test data, including gene sequencing data, may be used for research without patient consent, that provider should inform the patient of this possibility and of the availability of any opt-out.
Be scrupulous about data security
The research community's long practice of treating genomic information as de-identified or describing such data as ‘anonymized’ has impeded the development of community norms for data privacy and security in genomic research.
Providers, whether individuals or institutions, should only release EMR genomic data into secure environments. Release should be subject to a DUA between provider and recipient that contains enforceable indemnification provisions (supported by proof of insurance coverage) and is signed by a person with the authority to bind the researcher's employer entity. Terms of the DUA should include the following:
A minimum set of security standards that include encryption, storage only on secure servers behind institutional firewalls, and appropriate access and authentication protocols
A designated list of approved recipients, with a prohibition on access by, or transfer to, unapproved third parties without the provider's written permission.
A requirement to provide the data source an annual accounting of all copies and all users of the data set.
A prohibition on attempts to re-identify our contact data sources, or to create new, identifiable information through joinder with other available datasets.
Yet DUAs are not sufficient to truly protect privacy. DUAs provide no direct protection to data subjects, who are not parties to these agreements and whose information is already compromised in the event of any breach. The effectiveness of a DUA depends upon the data recipient's compliance; meaningful penalties for breach are difficult to enforce, especially in foreign jurisdictions.131 To achieve a more forward-looking security solution, federal policymakers should put a high priority on the development of secure, central data enclaves where researchers can access and analyse genomic data without creating and downloading new copies of the data.132 DbGaP, which currently distributes copies of the genomic data it warehouses, would seem the obvious starting point for such a project.
Build reciprocity: help patients see and share the benefits of EMR research
In many aspects of their lives, people give up privacy in exchange for something—often ease and convenience in using, for example, credit cards, websites that require cookies, or automated systems for paying road or bridge tolls. In medical research, with few exceptions, participation is often for the promise of future societal versus individual benefit. Telling those whose data is part of research about the concrete outcomes of that research—and doing it in language that they can understand—is one small but important way to try to ‘give back’ to those whose data was used in research, and perhaps to build support for research more broadly.
Yet genomic data provides more than just grist for a researcher's mill. Some things can be learned that individual patients or subjects might find valuable, or at least interesting. The question of ‘incidental findings’ has been a controversial one in the world of research ethics, but in some contexts, accurate and useful information about incidental findings can confer real benefits on research participants. People may also find interesting some general information about their genetic backgrounds. For example, ancestry information is not always benign, but, particularly at a relatively high level of abstraction, it usually will be. Similarly, some trait or even disease risk susceptibility information might, if sufficiently accurate, be interesting or even useful to research participants. Often, to preserve privacy, researchers agree not to attempt direct contact with subjects, and are not provided any contact information—making return of individual results impracticable, if not impossible. Even so, researchers should think hard about safe and useful ways in which, in summary form, the genomic information they are analyzing might somehow provide a nice little ‘thank you’ gift, a ‘lagniappe’, to the people whose data made their research possible.
The promise of big data and the appetite of researchers for access to information are enormous, to the point where, in pursuit of new knowledge, we've all but abandoned participant consent in records-based research, relying instead upon various degrees of de-identification to satisfy ethical concerns and meet regulatory requirements. There is very little in the way of transparency in most records-based research: apart from blanket reassurances in the HIPAA privacy notice that ‘your privacy is protected’, providers don't offer patients specifics about who will receive what information. Nor is any disclosure to patients likely to convey the uncertainty that lies beneath any categorical statements about privacy protection in research.
We can debate whether this preemption of individual choice is defensible. In the era of EMRs, the new knowledge obtained from population-scale, records-based research is immensely valuable; it may seem unfair to allow patients to benefit from these research findings without sharing in the privacy risk of the research itself. Proponents of choice preemption argue that where risk is minimal and benefits substantial, we should not allow dissenting patients to impose the response biases and process burdens that an opt-out would entail.133
We can even ask whether privacy and consent still matter, both as individual rights and as protections for human subjects, if de-identification strategies can effectively minimize the re-identification risk associated with a given set of EMR data. For many projects involving medical data this may be true. But when it comes to genomic research, scientists should not - and we believe, ethically cannot - promise anonymity, label gene sequences as de-identified information, or fail to tell patients, in specific terms, who is studying their EMR genomic data and where copies of those data reside.
Unrealistic and even deceptive promises of anonymity are all too common throughout the online world, where website privacy policies promise that their corporate sponsors collect only ‘anonymous’ user data, even as these same sites track and aggregate browsing habits to form highly detailed profiles of the shopping, reading, and religious preferences of individual consumers.
But while ‘browser beware’ may be the norm in the commercial internet space, patients expect, and have the right to expect, that their medical providers—and the researchers who use patient data that obtained from those providers—will hold themselves to a higher standard. For EMR-based genomic research, the starting point should be meaningful, specific notice to patients of all research uses and disclosures of genomic information. As technology permits, providers should also strive to offer patients some degree of control over discretionary uses such as research.
The standard ‘notice and choice’ model of privacy protection has limitations, especially in EMR research, where the future uses of patient data—and the future privacy risks—are unknown.134 Patients might give permission, but they can never provide fully informed consent at the point of care for all future genomic research. We should insist upon other protections for clinical genomic data used in research, such as data security measures that are no less rigorous than the standard for electronically maintained clinical information.
Given how rapidly the landscape of re-identification risk is evolving in genomic research, neither IRBs nor researchers can predict future risk with confidence. Geneticist George Church, who heads his own genomic sequencing project, argues that we should simply admit there is no reliable, enduring technical solution to privacy, and then work to convince DNA donors that the consequences of a research privacy breach are acceptable.135
We disagree. Researchers, providers, and regulators can—indeed must—do more than aim to convince patients to accept the privacy risks of EMR-based genomic research as an inescapable cost of receiving medical care. From an ethical standpoint, there is little meaningful difference between a research subject asked to contribute her blood specimen for gene sequencing—and afforded the right to say ‘no’, a right reaching back to the Nuremburg Code and other foundational statements of research ethics—and a patient whose genome is sequenced in the course of clinical care. Why isn't the patient entitled to know when her genome is shared with researchers? Why shouldn't she have a say in the matter?
Notice and a degree of control will produce one additional benefit: sunshine. If patients must be told which researchers, institutions, commercial entities, and federal research institutes receive their genomic data, patients can hold those recipients to account for data security, or even request that genomes be removed from research databases. This new scrutiny, though it might be uncomfortable at times, could actually prompt a greater level of patient engagement in genomic research. Researchers who can explain why EMR genomic research is valuable and how privacy is protected may find that patients, the ultimate beneficiaries, become vocal champions and enthusiastic participants. This paper is an effort both to point out the way the status quo impedes such a result and to describe a set of practices that are more likely to lead to it. We do not expect that we have said—or that anyone else will think we have said—the last word on this issue, but we hope we have opened, and moved forward, this crucial discussion.
This paper was funded by a grant from the Greenwall Foundation. The authors gratefully acknowledge the contributions of Debra Mathews, Ph.D., Michelle Meyer, J.D., Ph.D., and Mark Rothstein, J.D., each of whom commented on earlier drafts of the manuscript. The views expressed in the paper are our own and do not necessarily represent those of the reviewers or our employers.