首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 332 毫秒
1.
Microarray analysis is widely accepted for human cancer diagnosis and classification. However the high dimensionality of microarray data poses a great challenge to classification. Gene selection plays a key role in identifying salient genes from thousands of genes in microarray data that can directly contribute to the symptom of disease. Although various excellent selection methods are currently available, one common problem of these methods is that genes which have strong discriminatory power as a group but are weak as individuals will be discarded. In this paper, a new gene selection method is proposed for cancer diagnosis and classification by retaining useful intrinsic groups of interdependent genes. The primary characteristic of this method is that the relevance between each gene and target will be dynamically updated when a new gene is selected. The effectiveness of our method is validated by experiments on six publicly available microarray data sets. Experimental results show that the classification performance and enrichment score achieved by our proposed method is better than those of other selection methods.  相似文献   

2.
OBJECTIVE: The type of data in microarray provides unprecedented amount of data. A typical microarray data of ovarian cancer consists of the expressions of tens of thousands of genes on a genomic scale, and there is no systematic procedure to analyze this information instantaneously. To avoid higher computational complexity, it needs to select the most likely differentially expressed gene markers to explain the effects of ovarian cancer. Traditionally, gene markers are selected by ranking genes according to statistics or machine learning algorithms. In this paper, an integrated algorithm is derived for gene selection and classification in microarray data of ovarian cancer. METHODS: First, regression analysis is applied to find target genes. Genetic algorithm (GA), particle swarm optimization (PSO), support vector machine (SVM), and analysis of variance (ANOVA) are hybridized to select gene markers from target genes. Finally, the improved fuzzy model is applied to classify cancer tissues. RESULTS: The microarray data of ovarian cancer, obtained from China Medical University Hospital, is used to test the performance of the proposed algorithm. In simulation, 200 target genes are obtained after regression analysis and six gene markers are selected from the hybrid process of GA, PCO, SVM and ANOVA. Additionally, these gene markers are used to classify cancer tissues. CONCLUSIONS: The proposed algorithm can be used to analyze gene expressions and has superior performance in microarray data of ovarian cancer, and it can be performed on other studies for cancer diagnosis.  相似文献   

3.
Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data.  相似文献   

4.
With advances in microarray technology, many biomarkers selection approaches have been proposed for cancer diagnosis. Marker sets are selected by scoring genes for how well they can discriminate between different classes of diseases [1-4] or are ranked by significance analysis without reference to classification tasks. However there is a pressing need for methods integrating biological priori knowledge in the gene selection process. In this study, we proposed to identify genes primarily in terms of diagnostic outcome relevance. As gene expression is a combination effect, with the help of SVD, the microarray data is decomposed, the eigenvectors correspond to the biological effect of clinical outcomes are identified. Genes which play important roles in determining this biological effect are detected. Therefore, genes are essentially identified in terms of the strength of association with clinical outcomes and the relationship of genes and clinical outcomes is analyzed. Monte Carlo simulations are then used to fine tune the selected gene set in terms of classification accuracy. The approach was tested on four public data sets. Comparative studies show that the selected genes achieved higher classification accuracies. Graphical analysis visualizes that they have close relationship with the cancer class. Statistical simulation shows that the gene set found by the proposed method is also less variable and comparatively invariant to external influences. The biological relevance of the selected genes is further discussed and validated with the literature study and analysis of biological databases.  相似文献   

5.
相关向量机在肿瘤表达谱分类问题中的应用   总被引:1,自引:0,他引:1  
基因芯片技术能够检测大量基因的表达水平,在肿瘤研究中得到日益广泛的应用。基于基因芯片表达谱的肿瘤分类诊断是肿瘤表达谱研究的一个热点,肿瘤表达谱分类是一个典型的高维度小样本分类问题,描述一个两步策略的分类方法。在测试的基因表达谱中存在大量的非差异表达冗余基因,通过一个有效的基因预选择策略得到一个较小的候选基因子集,然后建立基于相关向量机的分类预测模型。在4个真实的肿瘤表达谱数据上,与几种不同的方法进行比较,结果显示该方法可以得到更好的分类精度,同时表现出很好的稳定性。  相似文献   

6.
In microarray-based cancer classification and prediction, gene selection is an important research problem owing to the large number of genes and the small number of experimental conditions. In this paper, we propose a Bayesian approach to gene selection and classification using the logistic regression model. The basic idea of our approach is in conjunction with a logistic regression model to relate the gene expression with the class labels. We use Gibbs sampling and Markov chain Monte Carlo (MCMC) methods to discover important genes. To implement Gibbs Sampler and MCMC search, we derive a posterior distribution of selected genes given the observed data. After the important genes are identified, the same logistic regression model is then used for cancer classification and prediction. Issues for efficient implementation for the proposed method are discussed. The proposed method is evaluated against several large microarray data sets, including hereditary breast cancer, small round blue-cell tumors, and acute leukemia. The results show that the method can effectively identify important genes consistent with the known biological findings while the accuracy of the classification is also high. Finally, the robustness and sensitivity properties of the proposed method are also investigated.  相似文献   

7.
Missing values in microarray data can significantly affect subsequent analysis, thus it is important to estimate these missing values accurately. In this paper, a sequential local least squares imputation (SLLSimpute) method is proposed to solve this problem. It estimates missing values sequentially from the gene containing the fewest missing values and partially utilizes these estimated values. In addition, an automatic parameter selection algorithm, which can generate an appropriate number of neighboring genes for each target gene, is presented for parameter estimation. Experimental results confirmed that SLLSimpute method exhibited better estimation ability compared with other currently used imputation methods.  相似文献   

8.
For each cancer type, only a few genes are informative. Due to the so-called ‘curse of dimensionality’ problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type.  相似文献   

9.
基于SVM-RFE-SFS的基因选择方法   总被引:2,自引:0,他引:2  
基因微阵列数据通常包含大量与肿瘤分类无关的数据,会严重降低肿瘤诊断的准确率;基因微阵列数据还存在小样本、高维度的问题,也增加了肿瘤诊断的难度,所以必须对其进行基因选择。提出一种新的基于支持向量机(SVM)、联合递归特征去除(RFE)和序列前向选择(SFS)的基因选择方法。首先利用SVM计算每个基因的排序准则分数,再利用排序准则分数的一阶差分把基因划分为若干小组;对排序准则分数值最小的基因小组进行递归特征去除,消去噪声基因,同时对排序准则分数值最大的基因小组进行序列前向选择,选取有效信息基因。对白血病、结肠癌、乳腺癌基因微阵列数据的实验结果表明,所提出的方法运行效率高、分类性能好。  相似文献   

10.
ObjectiveThe limitation of small sample size of functional genomics experiments has made it necessary to integrate DNA microarray experimental data from different sources. However, experimentation noises and biases of different microarray platforms have made integrated data analysis challenging. In this work, we propose an integrative computational framework to identify candidate biomarker genes from publicly available functional genomics studies.MethodsWe developed a new framework, Gaussian Mixture Modeling-Coupled Information Gain (GMM-IG). In this framework, we first apply a two-component Gaussian mixture model (GMM) to estimate the conditional probability distributions of gene expression data between two different types of samples, for example, normal versus cancer. An expectation-maximization algorithm is then used to estimate the maximum likelihood parameters of a mixture of two Gaussian models in the feature space and determine the underlying expression levels of genes. Gene expression results from different studies are discretized, based on GMM estimations and then unified. Significantly differentially-expressed genes are filtered and assessed with information gain (IG) measures.ResultsDNA microarray experimental data for lung cancers from three different prior studies was processed using the new GMM-IG method. Target gene markers from a gene expression panel were selected and compared with several conventional computational biomarker data analysis methods. GMM-IG showed consistently high accuracy for several classification assessments. A high reproducibility of gene selection results was also determined from statistical validations. Our study shows that the GMM-IG framework can overcome poor reliability issues from single-study DNA microarray experiment while maintaining high accuracies by combining true signals from multiple studies.ConclusionsWe present a conceptually simple framework that enables reliable integration of true differential gene expression signals from multiple microarray experiments. This novel computational method has been shown to generate interesting biomarker panels for lung cancer studies. It is promising as a general strategy for future panel biomarker development, especially for applications that requires integrating experimental results generated from different research centers or with different technology platforms.  相似文献   

11.
Classifiers have been widely used to select an optimal subset of feature genes from microarray data for accurate classification of cancer samples and cancer-related studies. However, the classification rules derived from most classifiers are complex and difficult to understand in biological significance. How to solve this problem is a new challenge. In this paper, a new classification model based on gene pair is proposed to address the problem. The experimental results on several microarray data demonstrate that the proposed classification model performs well in finding a large number of excellent feature gene pairs. A 100% LOOCV classification accuracy can be achieved using a single classification model based on optimal feature gene pair or combining multiple top-ranked classification models. Using the proposed method, we successfully identified important cancer-related genes that had been validated in previous biological studies while they were not discovered by the other methods.  相似文献   

12.
基因选择算法是辅助生物学分析最重要的方法之一,但这类统计学算法受样本量相对基因数目过少的困扰.提出一种结合Gene Ontology(GO)注释信息的基因选择算法,用GO注释接近基因的方差的加权平均进行修正,增强小样本量下对总体的估计,进而寻找差异表达基因.将该算法与其他5种常见算法对比,以选择出的基因为特征构建分类器,以分类器的可靠性作为衡量算法的标准.3组芯片实验的结果表明,该算法在小样本情况下具有一定优势.亦有Pubmed文献证明,该算法可以鉴别出其他算法未曾发现的致病基因.该方法所建立起来的框架,是把生物学注释信息引入算法改进的一种有效尝试.  相似文献   

13.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

14.
Although different statistical approaches have been proposed for analyzing microarray time-course data, method for analyzing such data collected using the popular case-control design in clinical investigations has not been proposed perhaps due to the increased complexity for the existing parametric or non-parametric approaches. In this paper, we introduce a new multivariate data analyzing technique, the correspondence analysis, to analyze the high dimensional microarray time-course data in case-control design. We show, through an example on type 2 diabetes, how the nice features of the correspondence analysis can be use to explore the various time-course gene expression profiles that exist in the data. By coordinating and examining the projections on the reduced dimensions by both the genes and the time-course experiments, we are able to identify important genes and time-course patterns and make inferences on their biological relevance. Using the sample replicates, we propose a bootstrap procedure for inferring the significance of contributions on the leading dimensions by both the time-course experiments and the genes. Striking differences in the time-course patterns in the normal controls and diabetes patients have been revealed. In addition, the method also identifies genes that display similar or comparable time-course expression patterns shared by both the cases and the controls. We conclude that our correspondence analysis based approach can be a useful tool for analyzing high dimensional microarray data collected in clinical investigations.  相似文献   

15.
An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a small sample of assays. Past and ongoing research efforts have been focused on biomarker selection for phenotype classification. Usually, many genes convey useless information for classifying the outcome and should be removed from the analysis; on the other hand, some of them may be highly correlated, which reveals the presence of redundant expressed information. In this paper we propose a method for the selection of highly predictive genes having a low redundancy in their expression levels. The predictive accuracy of the selection is assessed by means of Classification and Regression Trees (CART) models which enable assessment of the performance of the selected genes for classifying the outcome variable and will also uncover complex genetic interactions. The method is illustrated throughout the paper using a public domain colon cancer gene expression data set.  相似文献   

16.
17.
One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N. Standard statistical methods do not work well when N < M. Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel principal component analysis (KPCA) and classification with logistic regression (discrimination). KPCA is a generalization and nonlinear version of principal component analysis. The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data.  相似文献   

18.
Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the “curse of dimensionality” by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.  相似文献   

19.
MOTIVATIONS: One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? METHODS: A gene is relevant for a particular disease if we are able to correctly predict the occurrence of the pathology in new patients on the basis of its expression level only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. RESULTS: We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissues and 77 overexpressed in tumor tissues having prediction accuracy greater than 70% with p-value 相似文献   

20.
Gene expression datasets is a means to classify and predict the diagnostic categories of a patient. Informative genes and representative samples selection are two important aspects for reducing gene expression data. Identifying and pruning redundant genes and samples simultaneously can improve the performance of classification and circumvent the local optima problem. In the present paper, the modified particle swarm optimization was applied to selecting optimal genes and samples simultaneously and support vector machine was used as an objective function to determine the optimum set of genes and samples. To evaluate the performance of the new proposed method, it was applied to three publicly available microarray datasets. It has been demonstrated that the proposed method for gene and sample selection is a useful tool for mining high dimension data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号