首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 15 毫秒
1.
There is interest to expand the reach of literature mining to include the analysis of biomedical images, which often contain a paper's key findings. Examples include recent studies that use Optical Character Recognition (OCR) to extract image text, which is used to boost biomedical image retrieval and classification. Such studies rely on the robust identification of text elements in biomedical images, which is a non-trivial task. In this work, we introduce a new text detection algorithm for biomedical images based on iterative projection histograms. We study the effectiveness of our algorithm by evaluating the performance on a set of manually labeled random biomedical images, and compare the performance against other state-of-the-art text detection algorithms. We demonstrate that our projection histogram-based text detection approach is well suited for text detection in biomedical images, and that the iterative application of the algorithm boosts performance to an F score of .60. We provide a C++ implementation of our algorithm freely available for academic use.  相似文献   

2.
ObjectivesExtracting data from publication reports is a standard process in systematic review (SR) development. However, the data extraction process still relies too much on manual effort which is slow, costly, and subject to human error. In this study, we developed a text summarization system aimed at enhancing productivity and reducing errors in the traditional data extraction process.MethodsWe developed a computer system that used machine learning and natural language processing approaches to automatically generate summaries of full-text scientific publications. The summaries at the sentence and fragment levels were evaluated in finding common clinical SR data elements such as sample size, group size, and PICO values. We compared the computer-generated summaries with human written summaries (title and abstract) in terms of the presence of necessary information for the data extraction as presented in the Cochrane review’s study characteristics tables.ResultsAt the sentence level, the computer-generated summaries covered more information than humans do for systematic reviews (recall 91.2% vs. 83.8%, p < 0.001). They also had a better density of relevant sentences (precision 59% vs. 39%, p < 0.001). At the fragment level, the ensemble approach combining rule-based, concept mapping, and dictionary-based methods performed better than individual methods alone, achieving an 84.7% F-measure.ConclusionComputer-generated summaries are potential alternative information sources for data extraction in systematic review development. Machine learning and natural language processing are promising approaches to the development of such an extractive summarization system.  相似文献   

3.
Background and significanceSparsity is often a desirable property of statistical models, and various feature selection methods exist so as to yield sparser and interpretable models. However, their application to biomedical text classification, particularly to mortality risk stratification among intensive care unit (ICU) patients, has not been thoroughly studied.ObjectiveTo develop and characterize sparse classifiers based on the free text of nursing notes in order to predict ICU mortality risk and to discover text features most strongly associated with mortality.MethodsWe selected nursing notes from the first 24 h of ICU admission for 25,826 adult ICU patients from the MIMIC-II database. We then developed a pair of stochastic gradient descent-based classifiers with elastic-net regularization. We also studied the performance-sparsity tradeoffs of both classifiers as their regularization parameters were varied.ResultsThe best-performing classifier achieved a 10-fold cross-validated AUC of 0.897 under the log loss function and full L2 regularization, while full L1 regularization used just 0.00025% of candidate input features and resulted in an AUC of 0.889. Using the log loss (range of AUCs 0.889–0.897) yielded better performance compared to the hinge loss (0.850–0.876), but the latter yielded even sparser models.DiscussionMost features selected by both classifiers appear clinically relevant and correspond to predictors already present in existing ICU mortality models. The sparser classifiers were also able to discover a number of informative – albeit nonclinical – features.ConclusionThe elastic-net-regularized classifiers perform reasonably well and are capable of reducing the number of features required by over a thousandfold, with only a modest impact on performance.  相似文献   

4.
OBJECTIVE: Medical data is often very high dimensional. Depending upon the use, some data dimensions might be more relevant than others. In processing medical data, choosing the optimal subset of features is such important, not only to reduce the processing cost but also to improve the usefulness of the model built from the selected data. This paper presents a data mining study of medical data with fuzzy modeling methods that use feature subsets selected by some indices/methods. METHODS: Specifically, three fuzzy modeling methods including the fuzzy k-nearest neighbor algorithm, a fuzzy clustering-based modeling, and the adaptive network-based fuzzy inference system are employed. For feature selection, a total of 11 indices/methods are used. Medical data mined include the Wisconsin breast cancer dataset and the Pima Indians diabetes dataset. The classification accuracy and computational time are reported. To show how good the best performer is, the globally optimal was also found by carrying out an exhaustive testing of all possible combinations of feature subsets with three features. RESULTS: For the Wisconsin breast cancer dataset, the best accuracy of 97.17% was obtained, which is only 0.25% lower than that was obtained by exhaustive testing. For the Pima Indians diabetes dataset, the best accuracy of 77.65% was obtained, which is only 0.13% lower than that obtained by exhaustive testing. CONCLUSION: This paper has shown that feature selection is important to mining medical data for reducing processing time and for increasing classification accuracy. However, not all combinations of feature selection and modeling methods are equally effective and the best combination is often data-dependent, as supported by the breast cancer and diabetes data analyzed in this paper.  相似文献   

5.
A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.  相似文献   

6.
A multitude of information sources is present in the electronic health record (EHR), each of which can contain clues to automatically assign diagnosis and procedure codes. These sources however show information overlap and quality differences, which complicates the retrieval of these clues. Through feature selection, a denser representation with a consistent quality and less information overlap can be obtained. We introduce and compare coverage-based feature selection methods, based on confidence and information gain. These approaches were evaluated over a range of medical specialties, with seven different medical specialties for ICD-9-CM code prediction (six at the Antwerp University Hospital and one in the MIMIC-III dataset) and two different medical specialties for ICD-10-CM code prediction. Using confidence coverage to integrate all sources in an EHR shows a consistent improvement in F-measure (49.83% for diagnosis codes on average), both compared with the baseline (44.25% for diagnosis codes on average) and with using the best standalone source (44.41% for diagnosis codes on average). Confidence coverage creates a concise patient stay representation independent of a rigid framework such as UMLS, and contains easily interpretable features. Confidence coverage has several advantages to a baseline setup. In our baseline setup, feature selection was limited to a filter removing features with less than five total occurrences in the trainingset. Prediction results improved consistently when using multiple heterogeneous sources to predict clinical codes, while reducing the number of features and the processing time.  相似文献   

7.
In diagnosing diseases in clinical practice, a combination of three clinical findings is often used to represent each disease. This is largely because it is often difficult or impractical to assess for all possible combinations of symptoms and abnormal exam findings that occur in any particular disease. For most diseases, diagnostic triads are based on empirical observations. In this study, we determined diagnostic triads for chronic diseases using data mining procedures. We also verified the combinations’ validity as well as our procedure for determining them. We used symptoms and examination findings from 477 patients with chronic diseases, collected as part of a 35-year longitudinal study begun in 1968. For each patient there were 295 items from examinations in internal medicine, dermatology, ophthalmology, dentistry and blood tests. We judged each item to be either normal or abnormal, and restricted the analysis to the abnormal findings. To analyze such an exhaustive assortment, we used the data mining technique of association analysis. The analysis generated three clinical findings for each disease. Diseases were defined based on blood tests. Searching through all 295 items to find the three most useful clinical findings would be impractical on a commodity PC. However, by excluding normal items, we were able to sufficiently reduce the total number of combinations so as to make combinatorial analysis on a PC feasible. In addition to more accurate diagnoses, we believe our technique can identify those diagnostic data that are more cost effective in terms of time and other resources required for their collection.  相似文献   

8.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)  0.696; (2) it decreases classification accuracy when BCAL(G)  0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ < 1; (4) as the number of genes in a biological condition increases beyond 50 and δ  0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.  相似文献   

9.
OBJECTIVE: The general concept surrounding fMRI data analysis for decision support is leveraging previously hidden knowledge from publicly available metadata sources with a high degree of precision. METHODS AND MATERIALS: Normalized fMRI scans are used to calculate cumulative voxel intensity curves for every subject in the dataset that fits chosen demographic criteria. The voxel intensity curve has a direct linear relationship to the subject's neuronal activity. In the case of head trauma, a subject's voxel intensity curve would be statistically compared to the weighted average curve for every subject in dataset that is demographically similar. If the new subject's neuronal activity falls below the threshold for their demographic group, the brain injury detection (BID) system would then pinpoint the areas of deficiency based on Broadmann's cortical areas. ANALYSIS: The analysis presented in this paper indicates that statistical differences among demographic groups exist in BOLD fMRI responses. CONCLUSION: Useful knowledge can in fact be leveraged from mining stockpiled fMRI data without the need for unique human identifiers. The BID system offers the radiologist a statistically based decision support for brain injury.  相似文献   

10.
ObjectivesIt has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases.MethodsWe apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method.ResultsWe demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data.DiscussionOur framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification.ConclusionsThis framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools.  相似文献   

11.
ObjectivesExisting approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and “off the shelf” tools could predict cancer with performance metrics between 80% and 90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries.Materials and methodsWe evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve.ResultsDecision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70 and 90%. The source of features and feature subset size had no impact on the performance of a decision model.ConclusionOur study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing “off the shelf” approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.  相似文献   

12.
Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine literatures are matched with those from medicinal chemistry literatures by using this algorithm at different string identity levels (80-100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith-Waterman algorithm is useful for improving the success rate of biomedical text retrieval.  相似文献   

13.

Objectives

Despite medical advances, infectious diseases are still a major cause of mortality and morbidity, disability and socio-economic upheaval worldwide. Early diagnosis, appropriate choice and immediate initiation of antibiotic therapy can greatly affect the outcome of any kind of infection. Phagocytes play a central role in the innate immune response of the organism to infection. They comprise the first-line of defense against infectious intruders in our body, being able to produce large quantities of reactive oxygen species, which can be detected by means of chemiluminescence (CL). The data preparation approach implemented in this work corresponds to a dynamic assessment of phagocytic respiratory burst localization in a luminol-enhanced whole blood CL system. We have previously applied this approach to the problem of identifying various intra-abdominal pathological processes afflicting peritoneal dialysis patients in the Nephrology department and demonstrated 84.6% predictive accuracy with the C4.5 decision-tree algorithm. In this study, we apply the CL-based approach to a larger sample of patients from two departments (Nephrology and Internal Medicine) with the aim of finding the most effective and interpretable feature sets and classification models for a fast and accurate identification of several infectious diseases.

Materials and methods

Whole blood samples were collected from 78 patients (comprising 115 instances) with respiratory infections, infections associated with renal replacement therapy and patients without infections. CL kinetic parameters were calculated for each case, which was assigned into a specific clinical group according to the available clinical diagnostics. Feature selection wrapper and filter methods were applied to remove the irrelevant and redundant features and to improve the predictive performance of disease classification algorithms. Three data mining algorithms, C4.5 (J48) decision tree, support vector machines and naive Bayes classifier were applied for inducing disease classification models and their performance in classifying three clinical groups was evaluated by 10 runs of a stratified 10-fold cross-validation.

Results and conclusions

The results demonstrate that the predictive power of the best models obtained with the three evaluated algorithms after feature selection was found to be in the range of 63.38 ± 2.18-70.68 ± 1.43%. The highest disease classification accuracy was reached by C4.5, which also provides the most informative model in the form of a decision tree, and the lowest accuracy was obtained with naive Bayes. The feature selection method attaining the best classification performance was the wrapper method in forward direction. Moreover, the classification models exposed biological patterns specific to the clinical states and the predictive features selected were found to be characteristic of a specific disorder. Based on these encouraging results, we believe that the CL-based data pre-processing approach combined with the wrapper forward feature selection procedure and the C4.5 decision-tree algorithm has a clear potential to become a fast, informative, and sensitive tool for predictive diagnostics of infectious diseases in clinics.  相似文献   

14.
The term pap-smear refers to samples of human cells stained by the so-called Papanicolaou method. The purpose of the Papanicolaou method is to diagnose pre-cancerous cell changes before they progress to invasive carcinoma. In this paper a metaheuristic algorithm is proposed in order to classify the cells. Two databases are used, constructed in different times by expert MDs, consisting of 917 and 500 images of pap smear cells, respectively. Each cell is described by 20 numerical features, and the cells fall into 7 classes but a minimal requirement is to separate normal from abnormal cells, which is a 2 class problem. For finding the best possible performing feature subset selection problem, an effective genetic algorithm scheme is proposed. This algorithmic scheme is combined with a number of nearest neighbor based classifiers. Results show that classification accuracy generally outperforms other previously applied intelligent approaches.  相似文献   

15.
The identification of a set of relevant but not redundant features is an important first step in building predictive and diagnostic models from biomedical data sets. Most commonly, individual features are ranked in terms of a quality criterion, out of which the best (first) k features are selected. However, feature ranking methods do not sufficiently account for interactions and correlations between the features. Thus, redundancy is likely to be encountered in the selected features. We present a new algorithm, termed Redundancy Demoting (RD), that takes an arbitrary feature ranking as input, and improves this ranking by identifying redundant features and demoting them to positions in the ranking in which they are not redundant. Redundant features are those that are correlated with other features and not relevant in the sense that they do not improve the discriminatory ability of a set of features. Experiments on two cancer data sets, one melanoma image data set and one lung cancer microarray data set, show that our algorithm greatly improves the feature rankings provided by the methods information gain, ReliefF and Student’s t-test in terms of predictive power.  相似文献   

16.
17.
This paper analyses the performance of four different feature-selection approaches of the Karhunen-Loève expansion (KLE) method to select the most discriminant set of features for computer-assisted classification of bioprosthetic heart-valve status. First, an evaluation test reducing the number of initial features while maintaining the performance of the original classifier is developed. Secondly, the effectiveness of the classification in a simulated practical situation where a new sample has to be classified is estimated with a validation test. Results from both tests applied to a reference database show that the most efficient feature selection and classification (> or = 97% of correct classifications (CCs)) are performed by the Kittler and Young approach. For the clinical databases, this approach provides poor classification results for simulated 'new samples' (between 50 and 69% of CCs). For both the evaluation and the validation tests, only the Heydorn and Tou approach provides classification results comparable with those of the original classifier (a difference always < or = 7%). However, the degree of feature reduction is particularly variable. The study demonstrates that the KLE feature-selection approaches are highly population-dependent. It also shows that the validation method proposed is advantageous in clinical applications where the data collection is difficult to perform.  相似文献   

18.
ObjectiveIn the field of computer-aided detection (CAD) systems for lung nodules in computed tomography (CT) scans, many image features are presented and many artificial neural network (ANN) classifiers with various structural topologies are analyzed; frequently, the classifier topologies are selected by trial-and-error experiments. To avoid these trial and error approaches, we present a novel classifier that evolves ANNs using genetic algorithms, called “Phased Searching with NEAT in a Time or Generation-Scaled Framework”, integrating feature selection with the classification task.Methods and materialsWe analyzed our method's performance on 360 CT scans from the public Lung Image Database Consortium database. We compare our method's performance with other more-established classifiers, namely regular NEAT, Feature-Deselective NEAT (FD-NEAT), fixed-topology ANNs, and support vector machines (SVMs) using ten-fold cross-validation experiments of all 360 scans.ResultsThe results show that the proposed “Phased Searching” method performs better and faster than regular NEAT, better than FD-NEAT, and achieves sensitivities at 3 and 4 false positives (FP) per scan that are comparable with the fixed-topology ANN and SVM classifiers, but with fewer input features. It achieves a detection sensitivity of 83.0 ± 9.7% with an average of 4 FP/scan, for nodules with a diameter greater than or equal to 3 mm. It also evolves networks with shorter evolution times and with lower complexities than regular NEAT (p = 0.026 and p < 0.001, respectively). Analysis on the average and best network complexities evolved by regular NEAT and by our approach shows that our approach searches for good solutions in lower dimensional search spaces, and evolves networks without superfluous structure.ConclusionsWe have presented a novel approach that combines feature selection with the evolution of ANN topology and weights. Compared with the original threshold-based Phased Searching method of Green, our method requires fewer parameters and converges to the optimal network complexity required for the classification task at hand. The results of the ten-fold cross-validation experiments also show that our proposed CAD system for lung nodule detection performs well with respect to other methods in the literature.  相似文献   

19.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号