-
Views
-
Cite
Cite
Shun Liao, Jamie Kiros, Jiyang Chen, Zhaolei Zhang, Ting Chen, Improving domain adaptation in de-identification of electronic health records through self-training, Journal of the American Medical Informatics Association, Volume 28, Issue 10, October 2021, Pages 2093–2100, https://doi.org/10.1093/jamia/ocab128
- Share Icon Share
Abstract
De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain.
We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain.
In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge.
Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.