首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 388 毫秒
1.
Selecting relevant and discriminative genes for sample classification is a common and critical task in gene expression analysis (e.g. disease diagnostic). It is desirable that gene selection can improve classification performance of learning algorithm effectively. In general, for most gene selection methods widely used in reality, an individual gene subset will be chosen according to its discriminative power. One of deficiencies of individual gene subset is that its contribution to classification purpose is limited. This issue can be alleviated by ensemble gene selection based on random selection to some extend. However, the random one requires an unnecessary large number of candidate gene subsets and its reliability is a problem. In this study, we propose a new ensemble method, called ensemble gene selection by grouping (EGSG), to select multiple gene subsets for the classification purpose. Rather than selecting randomly, our method chooses salient gene subsets from microarray data by virtue of information theory and approximate Markov blanket. The effectiveness and accuracy of our method is validated by experiments on five publicly available microarray data sets. The experimental results show that our ensemble gene selection method has comparable classification performance to other gene selection methods, and is more stable than the random one.  相似文献   

2.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

3.
Due to recent advances in DNA microarray technology, using gene expression profiles, diagnostic category of tissue samples can be predicted with high accuracy. In this study, we discuss shortcomings of some existing gene expression profile classification methods and propose a new approach based on linear Bayesian classifiers. In our approach, we first construct gene-level linear classifiers to identify genes that provide high class-prediction accuracies, i.e., low error rates. After this screening phase, starting with the gene that offers the lowest error rate, we construct a multi-dimensional linear classifier by incorporating next best-performing genes, until the prediction error becomes minimum or 0, if possible. When we compared classification performance of our approach against prediction analysis of microarrays (PAM) and support vector machines (SVM) based approaches, we found that our method outperforms PAM and produces comparable results with SVM. In addition, we observed that the gene selection scheme of PAM could be misleading. Albeit SVM achieves relatively higher prediction performance, it has two major disadvantages: Complexity and lack of insight about important genes. Our intuitive approach offers competing performance and also an efficient means for finding important genes.  相似文献   

4.
A small number of features are significantly correlated with classification in high-dimensional data. An ensemble feature selection method based on cluster grouping is proposed in this paper. Classification-related features are chosen using a ranking aggregation technique. These features are divided into unrelated groups by an affinity propagation clustering algorithm with a bicor correlation coefficient. Some diversity and distinguishing feature subsets are constructed by randomly selecting a feature from each group and are used to train base classifiers. Finally, some base classifiers that have better classification performance are selected using a kappa coefficient and integrated using a majority voting strategy. The experimental results based on five gene expression datasets show that the proposed method has low classification error rates, stable classification performance and strong scalability in terms of sensitivity, specificity, accuracy and G-Mean criteria.  相似文献   

5.
Tumor classification is an important application domain of gene expression data. Because of its characteristics of high dimensionality and small sample size (SSS), and a great number of redundant genes not related to tumor phenotypes, various feature extraction or gene selection methods have been applied to gene expression data analysis. Wavelet packet transforms (WPT) and neighborhood rough sets (NRS) are effective tools to extract and select features. In this paper, a novel approach of tumor classification is proposed based on WPT and NRS. First the classification features are extracted by WPT and the decision tables are formed, then the attributes of the decision tables are reduced by NRS. Thirdly, a feature subset with few attributes and high classification ability is obtained. The experimental results on three gene expression datasets demonstrate that the proposed method is effective and feasible.  相似文献   

6.
OBJECTIVE: Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. MATERIAL AND METHODS: A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. RESULTS AND CONCLUSION: Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.  相似文献   

7.
8.
Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.  相似文献   

9.
乳腺癌基因数据的分类研究在临床医学上具有重要意义。针对基因数据的结构复杂、高维小样本等特点,提出一种最大相关最小条件冗余和深度级联森林结合的基因数据分类方法。选取博德基因研究所乳腺癌基因表达数据集,共98个数据作为样本,每个样本包含1 213个特征基因。首先对数据进行标准化处理,然后利用最大相关最小条件冗余选取特征子集,最后使用深度级联森林对特征子集进行分类。将随机森林、支持向量机和BP神经网络作为对比方法。结果表明,所提出的最大相关最小条件冗余和深度级联森林结合方法的最佳分类准确率达到93.78%,明显优于其他方法。该方法能有效提高乳腺癌基因数据的分类准确率,对基于基因数据的乳腺癌分类具有重要的理论意义与实用价值。  相似文献   

10.
OBJECT: The classification of cancer based on gene expression data is one of the most important procedures in bioinformatics. In order to obtain highly accurate results, ensemble approaches have been applied when classifying DNA microarray data. Diversity is very important in these ensemble approaches, but it is difficult to apply conventional diversity measures when there are only a few training samples available. Key issues that need to be addressed under such circumstances are the development of a new ensemble approach that can enhance the successful classification of these datasets. MATERIALS AND METHODS: An effective ensemble approach that does use diversity in genetic programming is proposed. This diversity is measured by comparing the structure of the classification rules instead of output-based diversity estimating. RESULTS: Experiments performed on common gene expression datasets (such as lymphoma cancer dataset, lung cancer dataset and ovarian cancer dataset) demonstrate the performance of the proposed method in relation to the conventional approaches. CONCLUSION: Diversity measured by comparing the structure of the classification rules obtained by genetic programming is useful to improve the performance of the ensemble classifier.  相似文献   

11.
Selecting a subset of genes with strong discriminative power is a very important step in classification problems based on gene expression data. Lasso and Dantzig selector are known to have automatic variable selection ability in linear regression analysis. This paper applies Lasso and Dantzig selector to select the most informative genes for representing the probability of an example being positive as a linear function of the gene expression data. The selected genes are further used to fit different classifiers for cancer classification. Comparative experiments were conducted on six publicly available cancer datasets, and the detailed comparison results show that in general, Lasso is more capable than Dantzig selector at selecting informative genes for cancer classification.  相似文献   

12.
For each cancer type, only a few genes are informative. Due to the so-called ‘curse of dimensionality’ problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type.  相似文献   

13.
14.
This paper presents an ensemble of feature selection and classification technique for classifying two types of breast lesion, benign and malignant. Features are selected based on their area under the ROC curves (AUC) which are then classified using a hybrid hidden Markov model (HMM)-fuzzy approach. HMM generated log-likelihood values are used to generate minimized fuzzy rules which are further optimized using gradient descent algorithms in order to enhance classification performance. The developed model is applied to Wisconsin breast cancer dataset to test its performance. The results indicate that a combination of selected features and the HMM-fuzzy approach can classify effectively the lesion types using only two fuzzy rules. Our experimental results also indicate that the proposed model can produce better classification accuracy when compared to most other computational tools.  相似文献   

15.
目的基于分子生物学的微阵列基因表达数据和智能优化算法对白血病肿瘤样本进行分类研究。方法给出基于粒子群优化(PSO)算法用于分类模型的训练和测试,选取含7129个基因的72个白血病基因表达样本,从中选取包含50、100和200个特征基因的3组数据,在不同基因数条件下分别执行10次分类测试。建立基于K-均值算法的分类模型,在同等条件下验证PSO算法分类性能。使用准确率、精确率、召回率、F1值等机器学习指标及Boxplot和Heatmap图谱用于分析对比。结果PSO算法用于分类测试的数据分别含20例急性淋巴细胞白血病(ALL)和14例急性髓细胞白血病(AML)样本。10次分类结果的平均分类准确率均在90%左右;PSO算法的分类准确率并不稳定,10次分类测试中,准确率的平均值和最优值间存在明显差异;ALL亚型的召回率明显高于AML亚型,均接近100%,但AML亚型的精确率明显高于ALL亚型,均接近100%,F1值可比性不大。K-均值算法与PSO算法类似,分类性能随着基因数的增加而降低;K-均值算法在200基因数条件下分类结果较差,分类稳定性和准确率均出现大幅下降,且低于同等条件下PSO算法分类结果;100个基因数条件下,ALL亚型召回率为100%,高于AML亚型;AML亚型精确率为100%,高于ALL亚型;200个基因数条件下,平均值中ALL亚型召回率和F1值高于AML亚型,AML亚型精确率高于ALL亚型,其最优值的统计学指标差异不大。相同白血病肿瘤样本的不同特征基因数条件下,PSO算法可获得较高准确率的分类结果,但分类稳定性不足,整体上优于K-均值算法。结论PSO算法能够应用于白血病基因表达样本的分类研究。  相似文献   

16.
17.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

18.
19.
BackgroundGene name recognition and normalization is, together with detection of other named entities, a crucial step in biomedical text mining and the underlying basis for development of more advanced techniques like extraction of complex events. While the current state of the art solutions achieve highly promising results on average, performance can drop significantly for specific genes with highly ambiguous synonyms. Depending on the topic of interest, this can cause the need for extensive manual curation of such text mining results. Our goal was to enhance this curation step based on tools widely used in pharmaceutical industry utilizing the text processing and classification capabilities of the Konstanz Information Miner (KNIME) along with publicly available sources.ResultsF-score achieved on gene specific test corpora for highly ambiguous genes could be improved from values close to zero, due to very low precision, to values >0.9 for several cases. Interestingly the presented approach even resulted in an increased F-score for genes showing already good results in initial gene name normalization. For most test cases, we could significantly improve precision, while retaining a high recall.ConclusionsWe could show that KNIME can be used to assist in manual curation of text mining results containing high numbers of false positive hits. Our results also indicate that it could be beneficial for future development in the field of gene name normalization to create gene specific training corpora based on incorrectly identified genes common to current state of the art algorithms.  相似文献   

20.
In microarray-based cancer classification and prediction, gene selection is an important research problem owing to the large number of genes and the small number of experimental conditions. In this paper, we propose a Bayesian approach to gene selection and classification using the logistic regression model. The basic idea of our approach is in conjunction with a logistic regression model to relate the gene expression with the class labels. We use Gibbs sampling and Markov chain Monte Carlo (MCMC) methods to discover important genes. To implement Gibbs Sampler and MCMC search, we derive a posterior distribution of selected genes given the observed data. After the important genes are identified, the same logistic regression model is then used for cancer classification and prediction. Issues for efficient implementation for the proposed method are discussed. The proposed method is evaluated against several large microarray data sets, including hereditary breast cancer, small round blue-cell tumors, and acute leukemia. The results show that the method can effectively identify important genes consistent with the known biological findings while the accuracy of the classification is also high. Finally, the robustness and sensitivity properties of the proposed method are also investigated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号