首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Cai X  Wei J  Wen G  Li J 《生物医学工程学杂志》2011,28(6):1213-1216
针对基因表达谱样本数据少、维度高、噪声大的特点,维数约减十分必要。由于基因表达谱数据是以一种高维非线性的向量存在,传统的降维方法使得一些本质维数较低的高维数据无法投影到低维空间中,为此本文引入一种改进距离的局部线性嵌入(LLE)算法对其进行降维。由于原始的LLE方法对近邻个数参数非常敏感,为了增强算法对近邻参数的鲁棒性,文中提出了一种改进距离来度量样本点之间的距离,从而降低了样本点分布不均匀对算法的影响。实验结果表明,改进距离的LLE方法能够有效地提取分类特征信息,并能够在保持较高的分类正确率的前提下大幅度地降低基因数据的维数。  相似文献   

2.
Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models is used to evaluate the significance of the genes. Informative genes are selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results is validated by multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained.  相似文献   

3.
This study attempts to propose an improved decision forest (IDF) with an integrated graphical user interface. Based on four gene expression data sets, the IDF not only outperforms the original decision forest, but also is superior or comparable to other state-of-the-art machine learning methods, especially in dealing with high dimensional data. With an integrated built-in feature selection (FS) mechanism and fewer parameters to tune, it can be trained more efficiently than methods such as support vector machine, and can be built with much fewer trees than other popular tree-based ensemble methods. Moreover, it suffers less from the curse of dimensionality.  相似文献   

4.
Due to recent advances in DNA microarray technology, using gene expression profiles, diagnostic category of tissue samples can be predicted with high accuracy. In this study, we discuss shortcomings of some existing gene expression profile classification methods and propose a new approach based on linear Bayesian classifiers. In our approach, we first construct gene-level linear classifiers to identify genes that provide high class-prediction accuracies, i.e., low error rates. After this screening phase, starting with the gene that offers the lowest error rate, we construct a multi-dimensional linear classifier by incorporating next best-performing genes, until the prediction error becomes minimum or 0, if possible. When we compared classification performance of our approach against prediction analysis of microarrays (PAM) and support vector machines (SVM) based approaches, we found that our method outperforms PAM and produces comparable results with SVM. In addition, we observed that the gene selection scheme of PAM could be misleading. Albeit SVM achieves relatively higher prediction performance, it has two major disadvantages: Complexity and lack of insight about important genes. Our intuitive approach offers competing performance and also an efficient means for finding important genes.  相似文献   

5.
The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks.  相似文献   

6.
Classification into multiple classes when the measured variables are outnumbered is a major methodological challenge in -omics studies. Two algorithms that overcome the dimensionality problem are presented: the forest classification tree (FCT) and the forest support vector machines (FSVM). In FCT, a set of variables is randomly chosen and a classification tree (CT) is grown using a forward classification algorithm. The process is repeated and a forest of CTs is derived. Finally, the most frequent variables from the trees with the smallest apparent misclassification rate (AMR) are used to construct a productive tree. In FSVM, the CTs are replaced by SVMs. The methods are demonstrated using prostate gene expression data for classifying tissue samples into four tumor types. For threshold split value 0.001 and utilizing 100 markers the productive CT consisted of 29 terminal nodes and achieved perfect classification (AMR=0). When the threshold value was set to 0.01, a tree with 17 terminal nodes was constructed based on 15 markers (AMR=7%). In FSVM, reducing the fraction of the forest that was used to construct the best classifier from the top 80% to the top 20% reduced the misclassification to 25% (when using 200 markers). The proposed methodologies may be used for identifying important variables in high dimensional data. Furthermore, the FCT allows exploring the data structure and provides a decision rule.  相似文献   

7.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

8.
9.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

10.
Microarray data analysis and classification has demonstrated convincingly that it provides an effective methodology for the effective diagnosis of diseases and cancers. Although much research has been performed on applying machine learning techniques for microarray data classification during the past years, it has been shown that conventional machine learning techniques have intrinsic drawbacks in achieving accurate and robust classifications. This paper presents a novel ensemble machine learning approach for the development of robust microarray data classification. Different from the conventional ensemble learning techniques, the approach presented begins with generating a pool of candidate base classifiers based on the gene sub-sampling and then the selection of a sub-set of appropriate base classifiers to construct the classification committee based on classifier clustering. Experimental results have demonstrated that the classifiers constructed by the proposed method outperforms not only the classifiers generated by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods (bagging and boosting).  相似文献   

11.
12.
The invention of microarrays has rapidly changed the state of biological and biomedical research. Clustering algorithms play an important role in clustering microarray data sets where identifying groups of co-expressed genes are a very difficult task. Here we have posed the problem of clustering the microarray data as a multiobjective clustering problem. A new symmetry based fuzzy clustering technique is developed to solve this problem. The effectiveness of the proposed technique is demonstrated on five publicly available benchmark data sets. Results are compared with some widely used microarray clustering techniques. Statistical and biological significance tests have also been carried out.  相似文献   

13.
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.  相似文献   

14.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

15.
(1)H MRS is an attractive choice for non-invasively diagnosing brain tumours. Many studies have been performed to create an objective decision support system, but there is not yet a consensus as to the best techniques of MRS acquisition or data processing to be used for optimum classification. In this study, we investigate whether LCModel analysis of short-TE (30 ms), single-voxel tumour spectra provide a better input for classification than the use of the original spectra. A total of 145 histologically diagnosed brain tumour spectra were acquired [14 astrocytoma grade II (AS2), 15 astrocytoma grade III (AS3), 42 glioblastoma (GBM), 41 metastases (MET) and 33 meningioma (MNG)], and linear discriminant analyses (LDA) were performed on the LCModel analysis of the spectra and the original spectra. The results consistently suggest improvement in classification when the LCModel concentrations are used. LDA of AS2, MNG and high-grade tumours (HG, comprising GBM and MET) correctly classified 94% using the LCModel dataset compared with 93% using the spectral dataset. The inclusion of AS3 reduced the accuracy to 82% and 78% for LCModel analysis and the original spectra, respectively, and further separating HG into GBM and MET gave 70% compared with 60%. Generally MNG spectra have profiles that are visually distinct from those of the other tumour types, but the classification accuracy was typically about 80%, with MNG with substantial lipid/macromolecule signals being classified as HG. Omission of the lipid/macromolecule concentrations in the LCModel dataset provided an improvement in classification of MNG (91% compared with 76%). In conclusion, there appears to be an advantage to performing pattern recognition on the quantitative analysis of tumour spectra rather than using the whole spectra. However, the results suggest that a two-step LDA process may help in classifying the five tumour groups to provide optimum classification of MNG with high lipid/macromolecule contributions which maybe misclassified as HG.  相似文献   

16.
Genomewide profiling of gene expression, made possible by the development of DNA microarray technology and more powerful by the sequencing of the human genome, has led to advances in tumor classification and biomarker discovery for the common types of human neoplasia. Application of this approach to the field of endocrine neoplasia is in its infancy, although some progress has been recently reported. In this review, the progress to date is summarized and the promise of DNA microarray analysis in conjunction with tissue array immunohistochemistry to significantly impact endocrine tumor diagnosis and prognosis is discussed.  相似文献   

17.
18.
19.
PurposeAdjuvant chemotherapy (ACT) is used after surgery to prevent recurrence or metastases. However, ACT for non-small cell lung cancer (NSCLC) is still controversial. This study aimed to develop prediction models to distinguish who is suitable for ACT (ACT-benefit) and who should avoid ACT (ACT-futile) in NSCLC.MethodsWe identified the ACT correlated gene signatures and performed several types of ANN algorithms to construct the optimal ANN architecture for ACT benefit classification. Reliability was assessed by cross-data set validation.ResultsWe obtained 2 probes (2 genes) with T-stage clinical data combination can get good prediction result. These genes included 208893_s_at (DUSP6) and 204891_s_at (LCK). The 10-fold cross validation classification accuracy was 65.71%. The best result of ANN models is MLP14-8-2 with logistic activation function.ConclusionsUsing gene signature profiles to predict ACT benefit in NSCLC is feasible. The key to this analysis was identifying the pertinent genes and classification. This study maybe helps reduce the ineffective medical practices to avoid the waste of medical resources.  相似文献   

20.
ObjectiveMining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies.MethodWe initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: (1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and (2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s).ResultsSimple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value < 0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease’s dataset.ConclusionIt is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号