首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 623 毫秒
1.
Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models is used to evaluate the significance of the genes. Informative genes are selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results is validated by multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained.  相似文献   

2.

Objective

Medical data sets are usually small and have very high dimensionality. Too many attributes will make the analysis less efficient and will not necessarily increase accuracy, while too few data will decrease the modeling stability. Consequently, the main objective of this study is to extract the optimal subset of features to increase analytical performance when the data set is small.

Methods

This paper proposes a fuzzy-based non-linear transformation method to extend classification related information from the original data attribute values for a small data set. Based on the new transformed data set, this study applies principal component analysis (PCA) to extract the optimal subset of features. Finally, we use the transformed data with these optimal features as the input data for a learning tool, a support vector machine (SVM). Six medical data sets: Pima Indians’ diabetes, Wisconsin diagnostic breast cancer, Parkinson disease, echocardiogram, BUPA liver disorders dataset, and bladder cancer cases in Taiwan, are employed to illustrate the approach presented in this paper.

Results

This research uses the t-test to evaluate the classification accuracy for a single data set; and uses the Friedman test to show the proposed method is better than other methods over the multiple data sets. The experiment results indicate that the proposed method has better classification performance than either PCA or kernel principal component analysis (KPCA) when the data set is small, and suggest creating new purpose-related information to improve the analysis performance.

Conclusion

This paper has shown that feature extraction is important as a function of feature selection for efficient data analysis. When the data set is small, using the fuzzy-based transformation method presented in this work to increase the information available produces better results than the PCA and KPCA approaches.  相似文献   

3.
阿尔茨海默症(AD)是一种起病隐匿、进行性发展的神经系统退行性疾病,利用磁共振成像和计算机技术对AD患者的辅助诊断是目前不断探索的新课题。本研究先对磁共振图像进行预处理和相关性分析,然后利用核主成分分析法(KPCA)对脑灰质图像进行特征提取,结合Adaboost算法进行分类,并与主成分分析法(PCA)进行对比试验。通过对AD神经影像学计划数据库中的116名AD患者、116名轻度认知障碍患者,以及117名正常对照的脑部功能磁共振成像进行的研究表明,利用机器学习能够很有效地辅助诊断AD脑部疾病,KPCA算法对图像进行特征提取比PCA 算法更加充分完备,分类结果更加精确,能够获得更好的AD辅助诊断结果。  相似文献   

4.
Classification into multiple classes when the measured variables are outnumbered is a major methodological challenge in -omics studies. Two algorithms that overcome the dimensionality problem are presented: the forest classification tree (FCT) and the forest support vector machines (FSVM). In FCT, a set of variables is randomly chosen and a classification tree (CT) is grown using a forward classification algorithm. The process is repeated and a forest of CTs is derived. Finally, the most frequent variables from the trees with the smallest apparent misclassification rate (AMR) are used to construct a productive tree. In FSVM, the CTs are replaced by SVMs. The methods are demonstrated using prostate gene expression data for classifying tissue samples into four tumor types. For threshold split value 0.001 and utilizing 100 markers the productive CT consisted of 29 terminal nodes and achieved perfect classification (AMR=0). When the threshold value was set to 0.01, a tree with 17 terminal nodes was constructed based on 15 markers (AMR=7%). In FSVM, reducing the fraction of the forest that was used to construct the best classifier from the top 80% to the top 20% reduced the misclassification to 25% (when using 200 markers). The proposed methodologies may be used for identifying important variables in high dimensional data. Furthermore, the FCT allows exploring the data structure and provides a decision rule.  相似文献   

5.
目的基因芯片技术对医学临床诊断、治疗、药物开发和筛选等技术的发展具有革命性的影响。针对高维医学数据降维困难及基因表达谱样本数据少、维度高、噪声大的特点,维数约减十分必要。基于主成分分析(principalcomponentanalysis,PCA)和线性判别分析(1ineardiscriminantanalysis,LDA)方法,有效解决了基因表达谱数据分类问题,并提高了识别率。方法分别引人PCA和LDA方法对基因表达谱数据进行降维,然后用K近邻(K—nearestneighbor,KNN)作为分类器对数据进行分类,并分别在乳腺癌和卵巢癌质谱数据上。结果在两类癌症质谱数据上应用PCA和LDA方法能够有效提取分类特征信息,并在保持较高分类正确率的前提下大幅度降低医学数据的维数。结论利用维数约减的方法对癌症基因表达谱数据进行分类,可辅助临床医生发现新的疾病特征,提高疾病诊断的正确率。  相似文献   

6.
Optical spectroscopy has shown potential as a real-time, in vivo, diagnostic tool for identifying neoplasia during endoscopy. We present the development of a diagnostic algorithm to classify elastic-scattering spectroscopy (ESS) spectra as either neoplastic or non-neoplastic. The algorithm is based on pattern recognition methods, including ensemble classifiers, in which members of the ensemble are trained on different regions of the ESS spectrum, and misclassification-rejection, where the algorithm identifies and refrains from classifying samples that are at higher risk of being misclassified. These "rejected" samples can be reexamined by simply repositioning the probe to obtain additional optical readings or ultimately by sending the polyp for histopathological assessment, as per standard practice. Prospective validation using separate training and testing sets result in a baseline performance of sensitivity = .83, specificity = .79, using the standard framework of feature extraction (principal component analysis) followed by classification (with linear support vector machines). With the developed algorithm, performance improves to Se ~ 0.90, Sp ~ 0.90, at a cost of rejecting 20-33% of the samples. These results are on par with a panel of expert pathologists. For colonoscopic prevention of colorectal cancer, our system could reduce biopsy risk and cost, obviate retrieval of non-neoplastic polyps, decrease procedure time, and improve assessment of cancer risk.  相似文献   

7.
Classification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer.  相似文献   

8.
相关向量机在肿瘤表达谱分类问题中的应用   总被引:1,自引:0,他引:1  
基因芯片技术能够检测大量基因的表达水平,在肿瘤研究中得到日益广泛的应用。基于基因芯片表达谱的肿瘤分类诊断是肿瘤表达谱研究的一个热点,肿瘤表达谱分类是一个典型的高维度小样本分类问题,描述一个两步策略的分类方法。在测试的基因表达谱中存在大量的非差异表达冗余基因,通过一个有效的基因预选择策略得到一个较小的候选基因子集,然后建立基于相关向量机的分类预测模型。在4个真实的肿瘤表达谱数据上,与几种不同的方法进行比较,结果显示该方法可以得到更好的分类精度,同时表现出很好的稳定性。  相似文献   

9.
The very high dimensional space of gene expression measurements obtained by DNA microarrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes. In this paper we show the use of projection methods such as principal components analysis (PCA) to obtain a direct link between patterns in the genes and patterns in samples. This feature is useful in the initial interactive pattern exploration of gene expression data and data-driven learning of the nature and types of samples. Using oligonucleotide microarray measurements of 40 samples from different normal human tissues, we show that distinct patterns are obtained when the genes are projected on a two-dimensional plane spanned by the loadings of the two major principal components. These patterns define the particular genes associated with a sample class (i.e., tissue). When used separately from the other genes, these class-specific (i.e., tissue-specific) genes in turn define distinct tissue patterns in the projection space spanned by the scores of the two major principal components. In this study, PCA projection facilitated discriminatory gene selection for different tissues and identified tissue-specific gene expression signatures for liver, skeletal muscle, and brain samples. Furthermore, it allowed the classification of nine new samples belonging to these three types using the linear combination of the expression levels of the tissue-specific genes determined from the first set of samples. The application of the technique to other published data sets is also discussed.  相似文献   

10.
11.
ObjectiveWe provide a survey of recent advances in biomedical image analysis and classification from emergent imaging modalities such as terahertz (THz) pulse imaging (TPI) and dynamic contrast-enhanced magnetic resonance images (DCE-MRIs) and identification of their underlining commonalities.MethodsBoth time and frequency domain signal pre-processing techniques are considered: noise removal, spectral analysis, principal component analysis (PCA) and wavelet transforms. Feature extraction and classification methods based on feature vectors using the above processing techniques are reviewed. A tensorial signal processing de-noising framework suitable for spatiotemporal association between features in MRI is also discussed.ValidationExamples where the proposed methodologies have been successful in classifying TPIs and DCE-MRIs are discussed.ResultsIdentifying commonalities in the structure of such heterogeneous datasets potentially leads to a unified multi-channel signal processing framework for biomedical image analysis.ConclusionThe proposed complex valued classification methodology enables fusion of entire datasets from a sequence of spatial images taken at different time stamps; this is of interest from the viewpoint of inferring disease proliferation. The approach is also of interest for other emergent multi-channel biomedical imaging modalities and of relevance across the biomedical signal processing community.  相似文献   

12.
Recent studies suggest accurate prediction of tissue of origin for human cancers can be achieved by applying sophisticated statistical learning procedures to gene expression data obtained from DNA microarrays. We have pursued the hypothesis that a more straightforward and equally accurate strategy for classifying human tumors is to use a simple algorithm that considers gene expression levels within a tree-based framework that encodes limited information about pathology and tissue ontogeny. By considering gene expression data within this framework, we found only a small number of genes were required to achieve a relatively high accuracy level in tumor classification. Using as few as 45 genes we were able to classify 157 of 190 human malignant tumors correctly, which is comparable to previous results obtained with sophisticated classifiers using thousands of genes. Our simple classifier accurately predicted the origin of metastatic tumors even when the classifier was trained using only primary tumors, and the classifier produced accurate predictions when trained and tested on expression data from different labs, and from different microarray platforms. Our findings suggest that accurate and robust cancer diagnosis from gene expression profiles can be achieved by mimicking the classification strategies routinely used by surgical pathologists.  相似文献   

13.
Non-invasively reconstructing the transmembrane potentials (TMPs) from body surface potentials (BSPs) constitutes one form of the inverse ECG problem that can be treated as a regression problem with multi-inputs and multi-outputs, and which can be solved using the support vector regression (SVR) method. In developing an effective SVR model, feature extraction is an important task for pre-processing the original input data. This paper proposes the application of principal component analysis (PCA) and kernel principal component analysis (KPCA) to the SVR method for feature extraction. Also, the genetic algorithm and simplex optimization method is invoked to determine the hyper-parameters of the SVR. Based on the realistic heart-torso model, the equivalent double-layer source method is applied to generate the data set for training and testing the SVR model. The experimental results show that the SVR method with feature extraction (PCA-SVR and KPCA-SVR) can perform better than that without the extract feature extraction (single SVR) in terms of the reconstruction of the TMPs on epi- and endocardial surfaces. Moreover, compared with the PCA-SVR, the KPCA-SVR features good approximation and generalization ability when reconstructing the TMPs.  相似文献   

14.
In a number of practical cases it is important to determine the likely geographical origin of an individual or a biological sample. A dead body, old bones or a sample of semen may be available. Information on where the sample might come from can assist investigation or research. The first part of this paper is independent of specific data structure. We formulate the problem as a classification problem. Bayes' theorem allows different sources of information or data to be reconciled conveniently. The main part of the paper involves high dimensional data for which simple, standard methods are not likely to work properly. Mitochondrial DNA (mtDNA) data is a typical example of such data. We propose a procedure involving essentially two steps. First, principal component analysis is used to reduce the dimension of the data. Next, quadratic discriminant analysis performs the actual classification. A cross validation procedure is implemented to select the optimal number of principal components. The importance of using separate data sets for model fitting and testing is emphasized. This method distinguishes well between individuals with a self reported European (Icelandic or German) origin and SE Africans. In this case the error rate is 2.0%.  相似文献   

15.
We propose methods to perform a certain nonlinear transformation of features based on a kernel matrix, before the classification step, aiming to improve the discriminating power of the comparatively weak edge-sharpness and texture features of breast masses in mammograms, and seek better incorporation of features representing different radiological characteristics than shape features only. Kernel principal component analysis (KPCA) is applied to improve the discriminating power of each single feature in an expanded feature space and the discriminating capability of different feature combinations in other transformed, more informative, lower-dimensional feature spaces. A kernel partial least squares (KPLS) method is developed to derive score vectors for a shape feature set, and an edge-sharpness and texture feature set, respectively, with minimal covariance between each other, to help in achieving improved diagnosis using multiple radiological characteristics of breast masses. Fisher's linear discriminant analysis (FLDA) is employed to evaluate the classification capability of the transformed features. The methods were tested with a set of 57 regions in mammograms, of which 20 are related to malignant tumors and 37 to benign masses, represented using five shape features, three edge-sharpness features, and 14 texture features. The classification performance of the edge-sharpness and texture features, via KPCA transformation, was significantly improved from 0.75 to 0.85 in terms of the area under the receiver operating characteristics curve (Az). The classification performance of all of the shape, edge-sharpness, and texture features, via KPLS transformation, was improved from 0.95 to 1.0 in Az value.  相似文献   

16.
Selecting relevant and discriminative genes for sample classification is a common and critical task in gene expression analysis (e.g. disease diagnostic). It is desirable that gene selection can improve classification performance of learning algorithm effectively. In general, for most gene selection methods widely used in reality, an individual gene subset will be chosen according to its discriminative power. One of deficiencies of individual gene subset is that its contribution to classification purpose is limited. This issue can be alleviated by ensemble gene selection based on random selection to some extend. However, the random one requires an unnecessary large number of candidate gene subsets and its reliability is a problem. In this study, we propose a new ensemble method, called ensemble gene selection by grouping (EGSG), to select multiple gene subsets for the classification purpose. Rather than selecting randomly, our method chooses salient gene subsets from microarray data by virtue of information theory and approximate Markov blanket. The effectiveness and accuracy of our method is validated by experiments on five publicly available microarray data sets. The experimental results show that our ensemble gene selection method has comparable classification performance to other gene selection methods, and is more stable than the random one.  相似文献   

17.
Methapyrilene (MP) exposure of animals can result in an array of adverse pathological responses including hepatotoxicity. This study investigates gene expression and histopathological alterations in response to MP treatment in order to 1) utilize computational approaches to classify samples derived from livers of MP treated rats based on severity of toxicity incurred in the corresponding tissue, 2) to phenotypically anchor gene expression pattems, and 3) to gain insight into mechanism(s) of methapyrilene hepatotoxicity. Large-scale differential gene expression levels associated with the exposure of male Sprague-Dawley rats to the rodent hepatic carcinogen MP for 1, 3, or 7 days after daily dosage with 10 or 100 mg/kg/day were monitored. Hierarchical clustering and principal component analysis were successful in classifying samples in agreement with microscopic observations and revealed low-dose effects that were not observed histopathologically. Data from cDNA microarray analysis corroborated observed histopathological alterations such as hepatocellular necrosis, bile duct hyperplasia, microvesicular vacuolization, and portal inflammation observed in the livers of MP exposed rats and provided insight into the role of specific genes in the studied toxicological processes.  相似文献   

18.
19.
目的基于分子生物学的微阵列基因表达数据和智能优化算法对白血病肿瘤样本进行分类研究。方法给出基于粒子群优化(PSO)算法用于分类模型的训练和测试,选取含7129个基因的72个白血病基因表达样本,从中选取包含50、100和200个特征基因的3组数据,在不同基因数条件下分别执行10次分类测试。建立基于K-均值算法的分类模型,在同等条件下验证PSO算法分类性能。使用准确率、精确率、召回率、F1值等机器学习指标及Boxplot和Heatmap图谱用于分析对比。结果PSO算法用于分类测试的数据分别含20例急性淋巴细胞白血病(ALL)和14例急性髓细胞白血病(AML)样本。10次分类结果的平均分类准确率均在90%左右;PSO算法的分类准确率并不稳定,10次分类测试中,准确率的平均值和最优值间存在明显差异;ALL亚型的召回率明显高于AML亚型,均接近100%,但AML亚型的精确率明显高于ALL亚型,均接近100%,F1值可比性不大。K-均值算法与PSO算法类似,分类性能随着基因数的增加而降低;K-均值算法在200基因数条件下分类结果较差,分类稳定性和准确率均出现大幅下降,且低于同等条件下PSO算法分类结果;100个基因数条件下,ALL亚型召回率为100%,高于AML亚型;AML亚型精确率为100%,高于ALL亚型;200个基因数条件下,平均值中ALL亚型召回率和F1值高于AML亚型,AML亚型精确率高于ALL亚型,其最优值的统计学指标差异不大。相同白血病肿瘤样本的不同特征基因数条件下,PSO算法可获得较高准确率的分类结果,但分类稳定性不足,整体上优于K-均值算法。结论PSO算法能够应用于白血病基因表达样本的分类研究。  相似文献   

20.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号