Text mining of cancer-related information: Review of current status and future directions |
| |
Affiliation: | 1. School of Computer Science & Informatics, Cardiff University, Cardiff CF24 3AA, UK;2. Clinical Outcomes Unit, The Christie NHS Foundation Trust, Manchester M20 4BX, UK;3. School of Computer Science, The University of Manchester, Manchester M13 9PL, UK;4. Health e-Research Centre, Manchester M13 9PL, UK;5. Manchester Institute of Biotecnology, Manchester M1 7DN, UK |
| |
Abstract: | PurposeThis paper reviews the research literature on text mining (TM) with the aim to find out (1) which cancer domains have been the subject of TM efforts, (2) which knowledge resources can support TM of cancer-related information and (3) to what extent systems that rely on knowledge and computational methods can convert text data into useful clinical information. These questions were used to determine the current state of the art in this particular strand of TM and suggest future directions in TM development to support cancer research.MethodsA review of the research on TM of cancer-related information was carried out. A literature search was conducted on the Medline database as well as IEEE Xplore and ACM digital libraries to address the interdisciplinary nature of such research. The search results were supplemented with the literature identified through Google Scholar.ResultsA range of studies have proven the feasibility of TM for extracting structured information from clinical narratives such as those found in pathology or radiology reports. In this article, we provide a critical overview of the current state of the art for TM related to cancer. The review highlighted a strong bias towards symbolic methods, e.g. named entity recognition (NER) based on dictionary lookup and information extraction (IE) relying on pattern matching. The F-measure of NER ranges between 80% and 90%, while that of IE for simple tasks is in the high 90s. To further improve the performance, TM approaches need to deal effectively with idiosyncrasies of the clinical sublanguage such as non-standard abbreviations as well as a high degree of spelling and grammatical errors. This requires a shift from rule-based methods to machine learning following the success of similar trends in biological applications of TM. Machine learning approaches require large training datasets, but clinical narratives are not readily available for TM research due to privacy and confidentiality concerns. This issue remains the main bottleneck for progress in this area. In addition, there is a need for a comprehensive cancer ontology that would enable semantic representation of textual information found in narrative reports. |
| |
Keywords: | Cancer Natural language processing Data mining Electronic medical records |
本文献已被 ScienceDirect 等数据库收录! |
|