首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 15 毫秒
1.
Since Golub applied gene expression profiles (GEP) to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis, a number of studies on GEP-based tumor classification have been done. However, the challenges from high dimension and small sample size of tumor dataset still exist. This paper presents a new tumor classification approach based on an ensemble of probabilistic neural network (PNN) and neighborhood rough set model based gene reduction. Informative genes were initially selected by gene ranking based on an iterative search margin algorithm and then were further refined by gene reduction to select many minimum gene subsets. Finally, the candidate base PNN classifiers trained by each of the selected gene subsets were integrated by majority voting strategy to construct an ensemble classifier. Experiments on tumor datasets showed that this approach can obtain both high and stable classification performance, which is not too sensitive to the number of initially selected genes and competitive to most existing methods. Additionally, the classification results can be cross-verified in a single biomedical experiment by the selected gene subsets, and biologically experimental results also proved that the genes included in the selected gene subsets are functionally related to carcinogenesis, indicating that the performance obtained by the proposed method is convincing.  相似文献   

2.
Tumor classification is an important application domain of gene expression data. Because of its characteristics of high dimensionality and small sample size (SSS), and a great number of redundant genes not related to tumor phenotypes, various feature extraction or gene selection methods have been applied to gene expression data analysis. Wavelet packet transforms (WPT) and neighborhood rough sets (NRS) are effective tools to extract and select features. In this paper, a novel approach of tumor classification is proposed based on WPT and NRS. First the classification features are extracted by WPT and the decision tables are formed, then the attributes of the decision tables are reduced by NRS. Thirdly, a feature subset with few attributes and high classification ability is obtained. The experimental results on three gene expression datasets demonstrate that the proposed method is effective and feasible.  相似文献   

3.
4.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

5.
Due to recent advances in DNA microarray technology, using gene expression profiles, diagnostic category of tissue samples can be predicted with high accuracy. In this study, we discuss shortcomings of some existing gene expression profile classification methods and propose a new approach based on linear Bayesian classifiers. In our approach, we first construct gene-level linear classifiers to identify genes that provide high class-prediction accuracies, i.e., low error rates. After this screening phase, starting with the gene that offers the lowest error rate, we construct a multi-dimensional linear classifier by incorporating next best-performing genes, until the prediction error becomes minimum or 0, if possible. When we compared classification performance of our approach against prediction analysis of microarrays (PAM) and support vector machines (SVM) based approaches, we found that our method outperforms PAM and produces comparable results with SVM. In addition, we observed that the gene selection scheme of PAM could be misleading. Albeit SVM achieves relatively higher prediction performance, it has two major disadvantages: Complexity and lack of insight about important genes. Our intuitive approach offers competing performance and also an efficient means for finding important genes.  相似文献   

6.
Gene selection is an important task in bioinformatics studies, because the accuracy of cancer classification generally depends upon the genes that have biological relevance to the classifying problems. In this work, randomization test (RT) is used as a gene selection method for dealing with gene expression data. In the method, a statistic derived from the statistics of the regression coefficients in a series of partial least squares discriminant analysis (PLSDA) models is used to evaluate the significance of the genes. Informative genes are selected for classifying the four gene expression datasets of prostate cancer, lung cancer, leukemia and non-small cell lung cancer (NSCLC) and the rationality of the results is validated by multiple linear regression (MLR) modeling and principal component analysis (PCA). With the selected genes, satisfactory results can be obtained.  相似文献   

7.
Gene selection from high-dimensional microarray gene-expression data is statistically a challenging problem. Filter approaches to gene selection have been popular because of their simplicity, efficiency, and accuracy. Due to small sample size, all samples are generally used to compute relevant ranking statistics and selection of samples in filter-based gene selection methods has not been addressed. In this paper, we extend previously-proposed simultaneous sample and gene selection approach. In a backward elimination method, a modified logistic regression loss function is used to select relevant samples at each iteration, and these samples are used to compute the T-score to rank genes. This method provides a compromise solution between T-score and other support vector machine (SVM) based algorithms. The performance is demonstrated on both simulated and real datasets with criteria such as classification performance, stability and redundancy. Results indicate that computational complexity and stability of the method are improved compared to SVM based methods without compromising the classification performance.  相似文献   

8.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

9.
Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL(G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL(G)  0.696; (2) it decreases classification accuracy when BCAL(G)  0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ < 1; (4) as the number of genes in a biological condition increases beyond 50 and δ  0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.  相似文献   

10.
Gene expression data collected from DNA microarray are characterized by a large amount of variables (genes), but with only a small amount of observations (experiments). In this paper, manifold learning method is proposed to map the gene expression data to a low dimensional space, and then explore the intrinsic structure of the features so as to classify the microarray data more accurately. The proposed algorithm can project the gene expression data into a subspace with high intra-class compactness and inter-class separability. Experimental results on six DNA microarray datasets demonstrated that our method is efficient for discriminant feature extraction and gene expression data classification. This work is a meaningful attempt to analyze microarray data using manifold learning method; there should be much room for the application of manifold learning to bioinformatics due to its performance.  相似文献   

11.
Selecting a subset of genes with strong discriminative power is a very important step in classification problems based on gene expression data. Lasso and Dantzig selector are known to have automatic variable selection ability in linear regression analysis. This paper applies Lasso and Dantzig selector to select the most informative genes for representing the probability of an example being positive as a linear function of the gene expression data. The selected genes are further used to fit different classifiers for cancer classification. Comparative experiments were conducted on six publicly available cancer datasets, and the detailed comparison results show that in general, Lasso is more capable than Dantzig selector at selecting informative genes for cancer classification.  相似文献   

12.
13.
Genomewide profiling of gene expression, made possible by the development of DNA microarray technology and more powerful by the sequencing of the human genome, has led to advances in tumor classification and biomarker discovery for the common types of human neoplasia. Application of this approach to the field of endocrine neoplasia is in its infancy, although some progress has been recently reported. In this review, the progress to date is summarized and the promise of DNA microarray analysis in conjunction with tissue array immunohistochemistry to significantly impact endocrine tumor diagnosis and prognosis is discussed.  相似文献   

14.
Our main interest in supervised classification of gene expression data is to infer whether the expressions can discriminate biological characteristics of samples. With thousands of gene expressions to consider, a gene selection has been advocated to decrease classification by including only the discriminating genes. We propose to make the gene selection based on partial least squares and logistic regression random-effects (RE) estimates before the selected genes are evaluated in classification models. We compare the selection with that based on the two-sample t-statistics, a current practice, and modified t-statistics. The results indicate that gene selection based on logistic regression RE estimates is recommended in a general situation, while the selection based on the PLS estimates is recommended when the number of samples is low. Gene selection based on the modified t-statistics performs well when the genes exhibit moderate-to-high variability with moderate group separation. Respecting the characteristics of the data is a key aspect to consider in gene selection.  相似文献   

15.
The purpose of this paper is to evaluate the effect of the combination of magnetic resonance spectroscopic imaging (MRSI) data and magnetic resonance imaging (MRI) data on the classification result of four brain tumor classes. Suppressed and unsuppressed short echo time MRSI and MRI were performed on 24 patients with a brain tumor and four volunteers. Four different feature reduction procedures were applied to the MRSI data: simple quantitation, principal component analysis, independent component analysis and LCModel. Water intensities were calculated from the unsuppressed MRSI data. Features were extracted from the MR images which were acquired with four different contrasts to comply with the spatial resolution of the MRSI. Evaluation was performed by investigating different combinations of the MRSI features, the MRI features and the water intensities. For each data set, the isolation in feature space of the tumor classes, healthy brain tissue and cerebrospinal fluid was calculated and visualized. A test set was used to calculate classification results for each data set. Finally, the effect of the selected feature reduction procedures on the MRSI data was investigated to ascertain whether it was more important than the addition of MRI information. Conclusions are that the combination of features from MRSI data and MRI data improves the classification result considerably when compared with features obtained from MRSI data alone. This effect is larger than the effect of specific feature reduction procedures on the MRSI data. The addition of water intensities to the data set also increases the classification result, although not significantly. We show that the combination of data from different MR investigations can be very important for brain tumor classification, particularly if a large number of tumors are to be classified simultaneously.  相似文献   

16.
The term pap-smear refers to samples of human cells stained by the so-called Papanicolaou method. The purpose of the Papanicolaou method is to diagnose pre-cancerous cell changes before they progress to invasive carcinoma. In this paper a metaheuristic algorithm is proposed in order to classify the cells. Two databases are used, constructed in different times by expert MDs, consisting of 917 and 500 images of pap smear cells, respectively. Each cell is described by 20 numerical features, and the cells fall into 7 classes but a minimal requirement is to separate normal from abnormal cells, which is a 2 class problem. For finding the best possible performing feature subset selection problem, an effective genetic algorithm scheme is proposed. This algorithmic scheme is combined with a number of nearest neighbor based classifiers. Results show that classification accuracy generally outperforms other previously applied intelligent approaches.  相似文献   

17.
For each cancer type, only a few genes are informative. Due to the so-called ‘curse of dimensionality’ problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type.  相似文献   

18.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号