-
Views
-
Cite
Cite
Yao-Shun Chuang, Atiquer Rahman Sarkar, Yu-Chun Hsu, Noman Mohammed, Xiaoqian Jiang, Robust privacy amidst innovation with large language models through a critical assessment of the risks, Journal of the American Medical Informatics Association, 2025;, ocaf037, https://doi.org/10.1093/jamia/ocaf037
- Share Icon Share
Abstract
This study evaluates the integration of electronic health records (EHRs) and natural language processing (NLP) with large language models (LLMs) to enhance healthcare data management and patient care, focusing on using advanced language models to create secure, Health Insurance Portability and Accountability Act-compliant synthetic patient notes for global biomedical research.
The study used de-identified and re-identified versions of the MIMIC III dataset with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic clinical notes. Text generation employed templates and keyword extraction for contextually relevant notes, with One-shot generation for comparison. Privacy was assessed by analyzing protected health information (PHI) occurrence and co-occurrence, while utility was evaluated by training an ICD-9 coder using synthetic notes. Text quality was measured using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and cosine similarity metrics to compare synthetic notes with source notes for semantic similarity.
The analysis of PHI occurrence and text utility via the ICD-9 coding task showed that the keyword-based method had low risk and good performance. One-shot generation exhibited the highest PHI exposure and PHI co-occurrence, particularly in geographic location and date categories. The Normalized One-shot method achieved the highest classification accuracy. Re-identified data consistently outperformed de-identified data.
Privacy analysis revealed a critical balance between data utility and privacy protection, influencing future data use and sharing.
This study shows that keyword-based methods can create synthetic clinical notes that protect privacy while retaining data usability, potentially improving clinical data sharing. The use of dummy PHIs to counter privacy attacks may offer better utility and privacy than traditional de-identification.