首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 577 毫秒
1.
For two-class problems, we introduce and construct mappings of high-dimensional instances into dissimilarity (distance)-based Class-Proximity Planes. The Class Proximity Projections are extensions of our earlier relative distance plane mapping, and thus provide a more general and unified approach to the simultaneous classification and visualization of many-feature datasets. The mappings display all L-dimensional instances in two-dimensional coordinate systems, whose two axes represent the two distances of the instances to various pre-defined proximity measures of the two classes. The Class Proximity mappings provide a variety of different perspectives of the dataset to be classified and visualized. We report and compare the classification and visualization results obtained with various Class Proximity Projections and their combinations on four datasets from the UCI data base, as well as on a particular high-dimensional biomedical dataset.  相似文献   

2.
Medical applications are often characterized by a large number of disease markers and a relatively small number of data records. We demonstrate that complete feature ranking followed by selection can lead to appreciable reductions in data dimensionality, with significant improvements in the implementation and performance of classifiers for medical diagnosis. We describe a novel approach for ranking all features according to their predictive quality using properties unique to learning algorithms based on the group method of data handling (GMDH). An abductive network training algorithm is repeatedly used to select groups of optimum predictors from the feature set at gradually increasing levels of model complexity specified by the user. Groups selected earlier are better predictors. The process is then repeated to rank features within individual groups. The resulting full feature ranking can be used to determine the optimum feature subset by starting at the top of the list and progressively including more features until the classification error rate on an out-of-sample evaluation set starts to increase due to overfitting. The approach is demonstrated on two medical diagnosis datasets (breast cancer and heart disease) and comparisons are made with other feature ranking and selection methods. Receiver operating characteristics (ROC) analysis is used to compare classifier performance. At default model complexity, dimensionality reduction of 22 and 54% could be achieved for the breast cancer and heart disease data, respectively, leading to improvements in the overall classification performance. For both datasets, considerable dimensionality reduction introduced no significant reduction in the area under the ROC curve. GMDH-based feature selection results have also proved effective with neural network classifiers.  相似文献   

3.
ObjectiveThis research is motivated by the issue of classifying illnesses of chronically ill patients for decision support in clinical settings. Our main objective is to propose multi-label classification of multivariate time series contained in medical records of chronically ill patients, by means of quantization methods, such as bag of words (BoW), and multi-label classification algorithms. Our second objective is to compare supervised dimensionality reduction techniques to state-of-the-art multi-label classification algorithms. The hypothesis is that kernel methods and locality preserving projections make such algorithms good candidates to study multi-label medical time series.MethodsWe combine BoW and supervised dimensionality reduction algorithms to perform multi-label classification on health records of chronically ill patients. The considered algorithms are compared with state-of-the-art multi-label classifiers in two real world datasets. Portavita dataset contains 525 diabetes type 2 (DT2) patients, with co-morbidities of DT2 such as hypertension, dyslipidemia, and microvascular or macrovascular issues. MIMIC II dataset contains 2635 patients affected by thyroid disease, diabetes mellitus, lipoid metabolism disease, fluid electrolyte disease, hypertensive disease, thrombosis, hypotension, chronic obstructive pulmonary disease (COPD), liver disease and kidney disease. The algorithms are evaluated using multi-label evaluation metrics such as hamming loss, one error, coverage, ranking loss, and average precision.ResultsNon-linear dimensionality reduction approaches behave well on medical time series quantized using the BoW algorithm, with results comparable to state-of-the-art multi-label classification algorithms. Chaining the projected features has a positive impact on the performance of the algorithm with respect to pure binary relevance approaches.ConclusionsThe evaluation highlights the feasibility of representing medical health records using the BoW for multi-label classification tasks. The study also highlights that dimensionality reduction algorithms based on kernel methods, locality preserving projections or both are good candidates to deal with multi-label classification tasks in medical time series with many missing values and high label density.  相似文献   

4.
This project was designed to harmonise the Royal College of Pathologists, College of American Pathologists and Royal College of Pathologists of Australasia datasets, checklists and structured reporting protocols for examination of radical prostatectomy specimens, with the aim of producing a common, internationally agreed, evidence‐based dataset for prostate cancer reporting. The International Collaboration on Cancer Reporting prostate cancer expert review panel analysed the three existing datasets, identifying concordant items and classified these data elements as ‘required’ (mandatory) or ‘recommended’ (non‐mandatory), on the basis of the published literature up to August 2011. Required elements were defined as those that have agreed evidentiary support at NHMRC level III‐2 or above. Consensus response values were formulated for each item. Twelve concordant pathology data elements were identified, and, on review, all but one were included as required elements for tumour staging, grading, or prediction of prognosis. There was minor discordance between the three existing datasets for another eight items, with two of these being added to the required data set. Another 11 elements with a lesser level of evidentiary support were included in the recommended dataset. This process was found to be an efficient method for producing an evidence‐based dataset for prostate cancer. Such internationally agreed datasets should facilitate meaningful comparison of benchmarking data, epidemiological studies, and clinical trials.  相似文献   

5.
Wang S  Yao J  Summers RM 《Medical physics》2008,35(4):1377-1386
Computer-aided detection (CAD) has been shown to be feasible for polyp detection on computed tomography (CT) scans. After initial detection, the dataset of colonic polyp candidates has large-scale and high dimensional characteristics. In this article, we propose a nonlinear dimensionality reduction method based on diffusion map and locally linear embedding (DMLLE) for large-scale datasets. By selecting partial data as landmarks, we first map these points into a low dimensional embedding space using the diffusion map. The embedded landmarks can be viewed as a skeleton of whole data in the low dimensional space. Then by using the locally linear embedding algorithm, nonlandmark samples are mapped into the same low dimensional space according to their nearest landmark samples. The local geometry is preserved in both the original high dimensional space and the embedding space. In addition, DMLLE provides a faithful representation of the original high dimensional data at coarse and fine scales. Thus, it can capture the intrinsic distance relationship between samples and reduce the influence of noisy features, two aspects that are crucial to achieving high classifier performance. We applied the proposed DMLLE method to a colonic polyp dataset of 175 269 polyp candidates with 155 features. Visual inspection shows that true polyps with similar shapes are mapped to close vicinity in the low dimensional space. We compared the performance of a support vector machine (SVM) classifier in the low dimensional embedding space with that in the original high dimensional space, SVM with principal component analysis dimensionality reduction and SVM committee using feature selection technology. Free-response receiver operating characteristic analysis shows that by using our DMLLE dimensionality reduction method, SVM achieves higher sensitivity with a lower false positive rate compared with other methods. For 6-9 mm polyps (193 true polyps contained in test set), when the number of false positives per patient is 9, SVM with DMLLE improves the average sensitivity from 70% to 83% compared with that of an SVM committee classifier which is a state-of-the-art method for colonic polyp detection (p<0.001).  相似文献   

6.
We introduce a distance (similarity)-based mapping for the visualization of high-dimensional patterns and their relative relationships. The mapping preserves exactly the original distances between points with respect to any two reference patterns in a special two-dimensional coordinate system, the relative distance plane (RDP). As only a single calculation of a distance matrix is required, this method is computationally efficient, an essential requirement for any exploratory data analysis. The data visualization afforded by this representation permits a rapid assessment of class pattern distributions. In particular, we can determine with a simple statistical test whether both training and validation sets of a 2-class, high-dimensional dataset derive from the same class distributions. We can explore any dataset in detail by identifying the subset of reference pairs whose members belong to different classes, cycling through this subset, and for each pair, mapping the remaining patterns. These multiple viewpoints facilitate the identification and confirmation of outliers. We demonstrate the effectiveness of this method on several complex biomedical datasets. Because of its efficiency, effectiveness, and versatility, one may use the RDP representation as an initial, data mining exploration that precedes classification by some classifier. Once final enhancements to the RDP mapping software are completed, we plan to make it freely available to researchers.  相似文献   

7.
In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component analysis (PCA) is a commonly used method which has problems. Nonnegative Matrix Factorization (NMF) is a new dimension reduction method. In this paper we compare NMF and PCA for dimension reduction. The reduced data is used for visualization, and clustering analysis via k-means on 11 real gene expression datasets. Before the clustering analysis, we apply NMF and PCA for reduction in visualization. The results on one leukemia dataset show that NMF can discover natural clusters and clearly detect one mislabeled sample while PCA cannot. For clustering analysis via k-means, NMF most typically outperforms PCA. Our results demonstrate the superiority of NMF over PCA in reducing microarray data.  相似文献   

8.
Recent advances in clinical proteomics data acquisition have led to the generation of datasets of high complexity and dimensionality. We present here a visualization method for high-dimensionality datasets that makes use of neuronal vectors of a trained growing cell structure (GCS) network for the projection of data points onto two dimensions. The use of a GCS network enables the generation of the projection matrix deterministically rather than randomly as in random projection. Three datasets were used to benchmark the performance and to demonstrate the use of this deterministic projection approach in real-life scientific applications. Comparisons are made to an existing self-organizing map projection method and random projection. The results suggest that deterministic projection outperforms existing methods and is suitable for the visualization of datasets of very high dimensionality.  相似文献   

9.
Two image datasets (one thick section dataset and another volumetric dataset) were typically reconstructed from each single CT projection data. The volumetric dataset was stored in a mini-PACS with 271-Gb online and 680-Gb nearline storage and routed to radiologists’ workstations, whereas the thick section dataset was stored in the main PACS. Over a 5-month sample period, 278 Gb of CT data (8976 examinations) was stored in the main PACS, and 738 Gb of volumetric datasets (6193 examinations) was stored in the mini-PACS. The volumetric datasets formed 32.8% of total data for all modalities (2.20 Tb) in the main PACS and mini-PACS combined. At the end of this period, the volumetric datasets of 1892 and 5162 examinations were kept online and nearline, respectively. Mini-PACS offers an effective method of archiving every volumetric dataset and delivering it to radiologists.  相似文献   

10.
Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the “curse of dimensionality” by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.  相似文献   

11.
Microarray data is a key source of experimental data for modelling gene regulatory interactions from expression levels. With the rapid increase of publicly available microarray data comes the opportunity to produce regulatory network models based on multiple datasets. Such models are potentially more robust with greater confidence, and place less reliance on a single dataset. However, combining datasets directly can be difficult as experiments are often conducted on different microarray platforms, and in different laboratories leading to inherent biases in the data that are not always removed through pre-processing such as normalisation. In this paper we compare two frameworks for combining microarray datasets to model regulatory networks: pre- and post-learning aggregation. In pre-learning approaches, such as using simple scale-normalisation prior to the concatenation of datasets, a model is learnt from a combined dataset, whilst in post-learning aggregation individual models are learnt from each dataset and the models are combined. We present two novel approaches for post-learning aggregation, each based on aggregating high-level features of Bayesian network models that have been generated from different microarray expression datasets. Meta-analysis Bayesian networks are based on combining statistical confidences attached to network edges whilst Consensus Bayesian networks identify consistent network features across all datasets. We apply both approaches to multiple datasets from synthetic and real (Escherichia coli and yeast) networks and demonstrate that both methods can improve on networks learnt from a single dataset or an aggregated dataset formed using a standard scale-normalisation.  相似文献   

12.
癫痫患者脑电信号的自动检测和发作诊断对临床治疗癫痫具有重要意义。针对训练数据有限及训练与测试数据分布不一致的难点,采用领域间联合知识迁移学习方法,实现小训练数据量下的癫痫状态识别。首先对脑电信号进行4层小波包分解,提取小波包分解系数作为特征,通过边缘分布和联合分布迭代调整,完成源域和目标域特征之间的知识迁移,训练空洞卷积神经网络作为分类器,完成目标域癫痫状态识别。分别在波士顿儿童医院CHB-MIT脑电数据集(22 例被试,共计790 h)和波恩大学癫痫脑电数据集(5 组,每组100 个片段,每段23.6 s)上进行算法验证,实验结果表明,所提出的方法对复杂癫痫状态的平均识别准确度、敏感性、特异性在CHB-MIT数据集上达到96.8%、96.1%、96.4%;在波恩数据集上,平均识别准确率为96.9%,有效提高了癫痫状态识别综合性能,实现了癫痫发作稳定可靠检测。  相似文献   

13.
Detecting epistatic interaction is a typical way of identifying the genetic susceptibility of complex diseases. Multifactor dimensionality reduction (MDR) is a decent solution for epistasis detection. Existing MDR‐based methods still suffer from high computational costs or poor performance. In this paper, we propose a new solution that integrates a dual screening strategy with MDR, termed as DualWMDR. Particularly, the first screening employs an adaptive clustering algorithm with part mutual information (PMI) to group single nucleotide polymorphisms (SNPs) and exclude noisy SNPs; the second screening takes into account both the single‐locus effect and interaction effect to select dominant SNPs, which effectively alleviates the negative impact of main effects and provides a much smaller but accurate candidate set for MDR. After that, MDR uses the weighted classification evaluation to improve its performance in epistasis identification on the candidate set. The results on diverse simulation datasets show that DualWMDR outperforms existing competitive methods, and the results on three real genome‐wide datasets: the age‐related macular degeneration (AMD) dataset, breast cancer (BC), and celiac disease (CD) datasets from the Wellcome Trust Case Control Consortium, again corroborate the effectiveness of DualWMDR.  相似文献   

14.
ObjectiveThe objective of this study is to help a team of physicians and knowledge engineers acquire clinical knowledge from existing practices datasets for treatment of head and neck cancer, to validate the knowledge against published guidelines, to create refined rules, and to incorporate these rules into clinical workflow for clinical decision support.Methods and materialsA team of physicians (clinical domain experts) and knowledge engineers adapt an approach for modeling existing treatment practices into final executable clinical models. For initial work, the oral cavity is selected as the candidate target area for the creation of rules covering a treatment plan for cancer. The final executable model is presented in HL7 Arden Syntax, which helps the clinical knowledge be shared among organizations. We use a data-driven knowledge acquisition approach based on analysis of real patient datasets to generate a predictive model (PM). The PM is converted into a refined-clinical knowledge model (R-CKM), which follows a rigorous validation process. The validation process uses a clinical knowledge model (CKM), which provides the basis for defining underlying validation criteria. The R-CKM is converted into a set of medical logic modules (MLMs) and is evaluated using real patient data from a hospital information system.ResultsWe selected the oral cavity as the intended site for derivation of all related clinical rules for possible associated treatment plans. A team of physicians analyzed the National Comprehensive Cancer Network (NCCN) guidelines for the oral cavity and created a common CKM. Among the decision tree algorithms, chi-squared automatic interaction detection (CHAID) was applied to a refined dataset of 1229 patients to generate the PM. The PM was tested on a disjoint dataset of 739 patients, which gives 59.0% accuracy. Using a rigorous validation process, the R-CKM was created from the PM as the final model, after conforming to the CKM. The R-CKM was converted into four candidate MLMs, and was used to evaluate real data from 739 patients, yielding efficient performance with 53.0% accuracy.ConclusionData-driven knowledge acquisition and validation against published guidelines were used to help a team of physicians and knowledge engineers create executable clinical knowledge. The advantages of the R-CKM are twofold: it reflects real practices and conforms to standard guidelines, while providing optimal accuracy comparable to that of a PM. The proposed approach yields better insight into the steps of knowledge acquisition and enhances collaboration efforts of the team of physicians and knowledge engineers.  相似文献   

15.
A relabeling algorithm for retrieval of noisy instances with binary outcomes is presented. The relabeling algorithm iteratively retrieves, selects, and re-labels data instances (i.e., transforms a decision space) to improve prediction quality. It emphasizes knowledge generalization and confidence rather than classification accuracy. A confidence index incorporating classification accuracy, prediction error, impurities in the relabeled dataset, and cluster purities was designed. The proposed approach is illustrated with a binary outcome dataset and was successfully tested on the standard benchmark four UCI repository dataset as well as bladder cancer immunotherapy data. A subset of the most stable instances (i.e., 7% to 51% of the sample) with high confidence (i.e., between 64%-99.44%) was identified for each application along with most noisy instances. The domain experts and the extracted knowledge validated the relabeled instances and corresponding confidence indexes. The relabeling algorithm with some modifications can be applied to other medical, industrial, and service domains.  相似文献   

16.
17.
【摘要】神经影像技术被广泛应用于研究大脑结构和功能异常与神经精神疾病之间的相关性。与传统的统计学分析方法不同,机器学习模型能对神经影像学数据进行个体化预测,发掘潜在的生物学标记物。神经精神疾病辅助诊断包含数据预处理和机器学习算法。数据预处理是一种人为的特征工程,为机器学习算法提供量化特征;机器学习算法包含特征降维、模型训练和模型评估。鲁棒的机器学习算法可以实现对不同数据集的准确预测,并提供对预测结果贡献大的特征,作为潜在的生物学标记物。本文综述了近年来基于机器学习的神经精神疾病辅助诊断研究进展,从数据预处理、机器学习算法和生物学标记物3个角度进行介绍,并展望未来的研究方向。 【关键词】神经精神疾病;神经影像;机器学习;辅助诊断  相似文献   

18.
Sparse Manifold Clustering and Embedding (SMCE) algorithm has been recently proposed for simultaneous clustering and dimensionality reduction of data on nonlinear manifolds using sparse representation techniques. In this work, SMCE algorithm is applied to the differential discrimination of Glioblastoma and Meningioma Tumors by means of their Gene Expression Profiles. Our purpose was to evaluate the robustness of this nonlinear manifold to classify gene expression profiles, characterized by the high-dimensionality of their representations and the low discrimination power of most of the genes. For this objective, we used SMCE to reduce the dimensionality of a preprocessed dataset of 35 single-labeling cDNA microarrays with 11500 original clones. Afterwards, supervised and unsupervised methodologies were applied to obtain the classification model: the former was based on linear discriminant analysis, the later on clustering using the SMCE embedding data. The results obtained using both approaches showed that all (100%) the samples could be correctly classified and the results of all repetitions but one formed a compatible cluster of predictive labels. Finally, the embedding dimensionality of the dataset extracted by SMCE revealed large discrimination margins between both classes.  相似文献   

19.
Curve of left ventricular (LV) volume changes throughout the cardiac cycle is a fundamental parameter for clinical evaluation of various cardiovascular diseases. Currently, this evaluation is often performed manually which is tedious and time consuming and suffers from significant interobserver and intraobserver variability. This paper introduces a new automatic method, based on nonlinear dimensionality reduction (NLDR) for extracting the curve of the LV volume changes over a cardiac cycle from two-dimensional (2-D) echocardiography images. Isometric feature mapping (Isomap) is one of the most popular NLDR algorithms. In this study, a modified version of Isomap algorithm, where image to image distance metric is computed using nonrigid registration, is applied on 2-D echocardiography images of one cycle of heart. Using this approach, the nonlinear information of these images is embedded in a 2-D manifold and each image is characterized by a symbol on the constructed manifold. This new representation visualizes the relationship between these images based on LV volume changes and allows extracting the curve of the LV volume changes automatically. Our method in comparison to the traditional segmentation algorithms does not need any LV myocardial segmentation and tracking, particularly difficult in the echocardiography images. Moreover, a large data set under various diseases for training is not required. The results obtained by our method are quantitatively evaluated to those obtained manually by the highly experienced echocardiographer on ten healthy volunteers and six patients which depict the usefulness of the presented method.  相似文献   

20.
MotivationCancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge.ResultsTo address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers.Availabilityhttps://github.com/cskyan/chmannot  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号