Effects of personal identifier resynthesis on clinical text de-identification期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

首页 | 本学科首页

官方微博 | 高级检索

按检索

Effects of personal identifier resynthesis on clinical text de-identification

Authors:	Reyyan Yeniterzi John Aberdeen Samuel Bayer Ben Wellner Lynette Hirschman Bradley Malin

Institution:	1.Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey;2.The MITRE Corporation, Bedford, Massachusetts, USA;3.Department of Computer Science, Brandeis University, Waltham, Massachusetts, USA;4.Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, Tennessee, USA

Abstract:	Objective De-identified medical records are critical to biomedical research. Text de-identification software exists, including “resynthesis” components that replace real identifiers with synthetic identifiers. The goal of this research is to evaluate the effectiveness and examine possible bias introduced by resynthesis on de-identification software. Design We evaluated the open-source MITRE Identification Scrubber Toolkit, which includes a resynthesis capability, with clinical text from Vanderbilt University Medical Center patient records. We investigated four record classes from over 500 patients'' files, including laboratory reports, medication orders, discharge summaries and clinical notes. We trained and tested the de-identification tool on real and resynthesized records. Measurements We measured performance in terms of precision, recall, F-measure and accuracy for the detection of protected health identifiers as designated by the HIPAA Safe Harbor Rule. Results The de-identification tool was trained and tested on a collection of real and resynthesized Vanderbilt records. Results for training and testing on the real records were 0.990 accuracy and 0.960 F-measure. The results improved when trained and tested on resynthesized records with 0.998 accuracy and 0.980 F-measure but deteriorated moderately when trained on real records and tested on resynthesized records with 0.989 accuracy 0.862 F-measure. Moreover, the results declined significantly when trained on resynthesized records and tested on real records with 0.942 accuracy and 0.728 F-measure. Conclusion The de-identification tool achieves high accuracy when training and test sets are homogeneous (ie, both real or resynthesized records). The resynthesis component regularizes the data to make them less “realistic,” resulting in loss of performance particularly when training on resynthesized data and testing on real data.

Keywords:	Privacy natural language processing computerized medical record systems medical informatics

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司京ICP备09084417号