首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.  相似文献   

2.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

3.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

4.
5.
Gene expression data collected from DNA microarray are characterized by a large amount of variables (genes), but with only a small amount of observations (experiments). In this paper, manifold learning method is proposed to map the gene expression data to a low dimensional space, and then explore the intrinsic structure of the features so as to classify the microarray data more accurately. The proposed algorithm can project the gene expression data into a subspace with high intra-class compactness and inter-class separability. Experimental results on six DNA microarray datasets demonstrated that our method is efficient for discriminant feature extraction and gene expression data classification. This work is a meaningful attempt to analyze microarray data using manifold learning method; there should be much room for the application of manifold learning to bioinformatics due to its performance.  相似文献   

6.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)  0.696; (2) it decreases classification accuracy when BCAL(G)  0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ < 1; (4) as the number of genes in a biological condition increases beyond 50 and δ  0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.  相似文献   

7.
Selecting a subset of genes with strong discriminative power is a very important step in classification problems based on gene expression data. Lasso and Dantzig selector are known to have automatic variable selection ability in linear regression analysis. This paper applies Lasso and Dantzig selector to select the most informative genes for representing the probability of an example being positive as a linear function of the gene expression data. The selected genes are further used to fit different classifiers for cancer classification. Comparative experiments were conducted on six publicly available cancer datasets, and the detailed comparison results show that in general, Lasso is more capable than Dantzig selector at selecting informative genes for cancer classification.  相似文献   

8.
This study attempts to propose an improved decision forest (IDF) with an integrated graphical user interface. Based on four gene expression data sets, the IDF not only outperforms the original decision forest, but also is superior or comparable to other state-of-the-art machine learning methods, especially in dealing with high dimensional data. With an integrated built-in feature selection (FS) mechanism and fewer parameters to tune, it can be trained more efficiently than methods such as support vector machine, and can be built with much fewer trees than other popular tree-based ensemble methods. Moreover, it suffers less from the curse of dimensionality.  相似文献   

9.
Discovery of differentially expressed genes between normal and diseased patients is a central research problem in bioinformatics. It is specially important to find few genetic markers which can be explored for diagnostic purposes. The performance of a set of markers is often measured by the associated classification accuracy. This motivates our ranking of genes depending on the minimum probability of classification errors (MPE) for each gene. In this work, we use Bayesian decision-making algorithm to compute MPE. A quantile-based probability density estimation technique is used for generating probability density functions of genes.The method is tested on three datasets: colon cancer, leukaemia, and hereditary breast cancer. The quality of the selected markers is evaluated by the classification accuracy obtained using support-vector-machine and a modified naive Bayes classifier. We obtain 96.77% accuracy in colon cancer and 97.06% accuracy in leukaemia, using only five genes in each case. Finally, using just three genes we get 100% accuracy in hereditary breast cancer.We also compare our results with those using the genes ranked by p-value and show that the genes ranked by MPE perform better or equal to those ranked by p-value.  相似文献   

10.
11.
The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks.  相似文献   

12.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.  相似文献   

13.
Gene expression data are the representation of nonlinear interactions among genes and environmental factors. Computing analysis of these data is expected to gain knowledge of gene functions and disease mechanisms. Clustering is a classical exploratory technique of discovering similar expression patterns and function modules. However, gene expression data are usually of high dimensions and relatively small samples, which results in the main difficulty for the application of clustering algorithms. Principal component analysis (PCA) is usually used to reduce the data dimensions for further clustering analysis. While PCA estimates the similarity between expression profiles based on the Euclidean distance, which cannot reveal the nonlinear connections between genes. This paper uses nonlinear dimensionality reduction (NDR) as a preprocessing strategy for feature selection and visualization, and then applies clustering algorithms to the reduced feature spaces. In order to estimate the effectiveness of NDR for capturing biologically relevant structures, the comparative analysis between NDR and PCA is exploited to five real cancer expression datasets. Results show that NDR can perform better than PCA in visualization and clustering analysis of complex gene expression data.  相似文献   

14.
Techniques for clustering gene expression data   总被引:1,自引:0,他引:1  
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.  相似文献   

15.
16.
Tumorigenesis is governed by a series of complex genetic and epigenetic changes. Both mechanisms can result in either the silencing or aberrant expression of messages in a cell. Gene expression profiling techniques such as the serial analysis of gene expression (SAGE) or microarray analysis can provide global overviews of these changes, as well identify key genes and pathways involved in this process. This review outlines the current roles of these techniques in cancer research, and how they may contribute to finding not only mechanisms of this disease, but potential targets for therapy.  相似文献   

17.
18.
Since Golub applied gene expression profiles (GEP) to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis, a number of studies on GEP-based tumor classification have been done. However, the challenges from high dimension and small sample size of tumor dataset still exist. This paper presents a new tumor classification approach based on an ensemble of probabilistic neural network (PNN) and neighborhood rough set model based gene reduction. Informative genes were initially selected by gene ranking based on an iterative search margin algorithm and then were further refined by gene reduction to select many minimum gene subsets. Finally, the candidate base PNN classifiers trained by each of the selected gene subsets were integrated by majority voting strategy to construct an ensemble classifier. Experiments on tumor datasets showed that this approach can obtain both high and stable classification performance, which is not too sensitive to the number of initially selected genes and competitive to most existing methods. Additionally, the classification results can be cross-verified in a single biomedical experiment by the selected gene subsets, and biologically experimental results also proved that the genes included in the selected gene subsets are functionally related to carcinogenesis, indicating that the performance obtained by the proposed method is convincing.  相似文献   

19.
PurposeAdjuvant chemotherapy (ACT) is used after surgery to prevent recurrence or metastases. However, ACT for non-small cell lung cancer (NSCLC) is still controversial. This study aimed to develop prediction models to distinguish who is suitable for ACT (ACT-benefit) and who should avoid ACT (ACT-futile) in NSCLC.MethodsWe identified the ACT correlated gene signatures and performed several types of ANN algorithms to construct the optimal ANN architecture for ACT benefit classification. Reliability was assessed by cross-data set validation.ResultsWe obtained 2 probes (2 genes) with T-stage clinical data combination can get good prediction result. These genes included 208893_s_at (DUSP6) and 204891_s_at (LCK). The 10-fold cross validation classification accuracy was 65.71%. The best result of ANN models is MLP14-8-2 with logistic activation function.ConclusionsUsing gene signature profiles to predict ACT benefit in NSCLC is feasible. The key to this analysis was identifying the pertinent genes and classification. This study maybe helps reduce the ineffective medical practices to avoid the waste of medical resources.  相似文献   

20.
目的基因芯片技术对医学临床诊断、治疗、药物开发和筛选等技术的发展具有革命性的影响。针对高维医学数据降维困难及基因表达谱样本数据少、维度高、噪声大的特点,维数约减十分必要。基于主成分分析(principalcomponentanalysis,PCA)和线性判别分析(1ineardiscriminantanalysis,LDA)方法,有效解决了基因表达谱数据分类问题,并提高了识别率。方法分别引人PCA和LDA方法对基因表达谱数据进行降维,然后用K近邻(K—nearestneighbor,KNN)作为分类器对数据进行分类,并分别在乳腺癌和卵巢癌质谱数据上。结果在两类癌症质谱数据上应用PCA和LDA方法能够有效提取分类特征信息,并在保持较高分类正确率的前提下大幅度降低医学数据的维数。结论利用维数约减的方法对癌症基因表达谱数据进行分类,可辅助临床医生发现新的疾病特征,提高疾病诊断的正确率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号