首页 | 本学科首页   官方微博 | 高级检索  
检索        


De-identification of clinical narratives through writing complexity measures
Institution:1. Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, United States;2. Group Health Research Institute, Seattle, WA, United States;3. The MITRE Corporation, Bedford, MA, United States;4. Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, United States;1. Department of Advertising and Public Relations, The University of Texas at Austin, United States;2. School of Public Health, The University of Texas Health Science Center at Houston, United States;1. Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States;2. VA Health Care System, Salt Lake City, UT, United States;3. Nuance Communications Inc., Burlington, MA, United States;4. Regenstrief Institute, Inc., Indianapolis, IN, United States;5. Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States;1. UNSW, School of Public Health & Community Medicine, Sydney, Australia;2. Isfahan University of Medical Sciences, Health Information Research Centre, Isfahan, Iran;3. UNSW, Centre for Primary Health Care & Equity, Sydney, Australia;4. UNSW, Asia-Pacific Ubiquitous Healthcare Research Centre, Sydney, Australia;5. General Practice Unit, South Western Sydney Local Health District;1. Department of Information and Computing Sciences, Utrecht University, P.O. Box 80089, 3508 TB Utrecht, The Netherlands;2. Department of Psychiatry, University Medical Center Utrecht, P.O. Box 85500, 3508 GA Utrecht, The Netherlands
Abstract:PurposeElectronic health records contain a substantial quantity of clinical narrative, which is increasingly reused for research purposes. To share data on a large scale and respect privacy, it is critical to remove patient identifiers. De-identification tools based on machine learning have been proposed; however, model training is usually based on either a random group of documents or a pre-existing document type designation (e.g., discharge summary). This work investigates if inherent features, such as the writing complexity, can identify document subsets to enhance de-identification performance.MethodsWe applied an unsupervised clustering method to group two corpora based on writing complexity measures: a collection of over 4500 documents of varying document types (e.g., discharge summaries, history and physical reports, and radiology reports) from Vanderbilt University Medical Center (VUMC) and the publicly available i2b2 corpus of 889 discharge summaries. We compare the performance (via recall, precision, and F-measure) of de-identification models trained on such clusters with models trained on documents grouped randomly or VUMC document type.ResultsFor the Vanderbilt dataset, it was observed that training and testing de-identification models on the same stylometric cluster (with the average F-measure of 0.917) tended to outperform models based on clusters of random documents (with an average F-measure of 0.881). It was further observed that increasing the size of a training subset sampled from a specific cluster could yield improved results (e.g., for subsets from a certain stylometric cluster, the F-measure raised from 0.743 to 0.841 when training size increased from 10 to 50 documents, and the F-measure reached 0.901 when the size of the training subset reached 200 documents). For the i2b2 dataset, training and testing on the same clusters based on complexity measures (average F-score 0.966) did not significantly surpass randomly selected clusters (average F-score 0.965).ConclusionsOur findings illustrate that, in environments consisting of a variety of clinical documentation, de-identification models trained on writing complexity measures are better than models trained on random groups and, in many instances, document types.
Keywords:Electronic medical records  Privacy  Natural language processing
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号