首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 381 毫秒
1.
In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component analysis (PCA) is a commonly used method which has problems. Nonnegative Matrix Factorization (NMF) is a new dimension reduction method. In this paper we compare NMF and PCA for dimension reduction. The reduced data is used for visualization, and clustering analysis via k-means on 11 real gene expression datasets. Before the clustering analysis, we apply NMF and PCA for reduction in visualization. The results on one leukemia dataset show that NMF can discover natural clusters and clearly detect one mislabeled sample while PCA cannot. For clustering analysis via k-means, NMF most typically outperforms PCA. Our results demonstrate the superiority of NMF over PCA in reducing microarray data.  相似文献   

2.
目的基因芯片技术对医学临床诊断、治疗、药物开发和筛选等技术的发展具有革命性的影响。针对高维医学数据降维困难及基因表达谱样本数据少、维度高、噪声大的特点,维数约减十分必要。基于主成分分析(principalcomponentanalysis,PCA)和线性判别分析(1ineardiscriminantanalysis,LDA)方法,有效解决了基因表达谱数据分类问题,并提高了识别率。方法分别引人PCA和LDA方法对基因表达谱数据进行降维,然后用K近邻(K—nearestneighbor,KNN)作为分类器对数据进行分类,并分别在乳腺癌和卵巢癌质谱数据上。结果在两类癌症质谱数据上应用PCA和LDA方法能够有效提取分类特征信息,并在保持较高分类正确率的前提下大幅度降低医学数据的维数。结论利用维数约减的方法对癌症基因表达谱数据进行分类,可辅助临床医生发现新的疾病特征,提高疾病诊断的正确率。  相似文献   

3.
对基因芯片表达谱的聚类分析有助于发现共表达的基因,而共表达的特性往往是共调控基因所拥有的性质。因此,对基因表达谱的准确聚类将有利于更加准确地发现基因之间的调控关系。本研究使用机器学习中的等度规映射、局部线性嵌入、拉普拉斯特征根映射等流形学习方法处理基因表达谱数据,得到非线性降维后的数据。在此基础上应用K均值聚类、模糊聚类、自组织映射神经网络等聚类方法,根据给定的阈值,从酵母基因表达数据的382个聚类结果中得到了117个共表达基因对,而从人类血清组织细胞的基因表达数据的132个聚类结果中得到了89个共表达基因对。使用的判别准则表明,基于流形学习的聚类方法与以往的方法相当,且能够被用以发现高维基因芯片表达数据中的低维的流形结构。  相似文献   

4.
A drastic improvement in the analysis of gene expression has lead to new discoveries in bioinformatics research. In order to analyse the gene expression data, fuzzy clustering algorithms are widely used. However, the resulting analyses from these specific types of algorithms may lead to confusion in hypotheses with regard to the suggestion of dominant function for genes of interest. Besides that, the current fuzzy clustering algorithms do not conduct a thorough analysis of genes with low membership values. Therefore, we present a novel computational framework called the “multi-stage filtering-Clustering Functional Annotation” (msf-CluFA) for clustering gene expression data. The framework consists of four components: fuzzy c-means clustering (msf-CluFA-0), achieving dominant cluster (msf-CluFA-1), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2 (msf-CluFA-3). By employing double filtering in msf-CluFA-1 and apriori algorithms in msf-CluFA-2, our new framework is capable of determining the dominant clusters and improving the confidence level of genes with lower membership values by means of which the unknown genes can be predicted.  相似文献   

5.
Clustering algorithms have been shown to be useful to explore large-scale gene expression profiles. Visualization and objective evaluation of clusters are two important considerations when users are selecting different clustering algorithms, but they are often overlooked. The developments of a framework and software tools that implement comprehensive data visualization and objective measures of cluster quality are crucial. In this paper, we describe a theoretical framework and formalizations for consistently developing clustering algorithms. A new clustering algorithm was developed within the proposed framework. We demonstrate that a theoretically sound principle can be uniformly applied to the developments of cluster-optimization function, comprehensive data-visualization strategy, and objective cluster-evaluation measures as well as actual implementation of the principle. Cluster consistency and quality measures of the algorithm are rigorously evaluated against those of popular clustering algorithms for gene expression data analysis (K-means and self-organizing maps), in four data sets, yielding promising results.  相似文献   

6.
Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm.  相似文献   

7.
8.
9.
The very high dimensional space of gene expression measurements obtained by DNA microarrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes. In this paper we show the use of projection methods such as principal components analysis (PCA) to obtain a direct link between patterns in the genes and patterns in samples. This feature is useful in the initial interactive pattern exploration of gene expression data and data-driven learning of the nature and types of samples. Using oligonucleotide microarray measurements of 40 samples from different normal human tissues, we show that distinct patterns are obtained when the genes are projected on a two-dimensional plane spanned by the loadings of the two major principal components. These patterns define the particular genes associated with a sample class (i.e., tissue). When used separately from the other genes, these class-specific (i.e., tissue-specific) genes in turn define distinct tissue patterns in the projection space spanned by the scores of the two major principal components. In this study, PCA projection facilitated discriminatory gene selection for different tissues and identified tissue-specific gene expression signatures for liver, skeletal muscle, and brain samples. Furthermore, it allowed the classification of nine new samples belonging to these three types using the linear combination of the expression levels of the tissue-specific genes determined from the first set of samples. The application of the technique to other published data sets is also discussed.  相似文献   

10.
BACKGROUND AND MOTIVATION: DNA microarray technology has made it possible to determine the expression levels of thousands of genes in parallel under multiple experimental conditions. Genome-wide analyses using DNA microarrays make a great contribution to the exploration of the dynamic state of genetic networks, and further lead to the development of new disease diagnosis technologies. An important step in the analysis of gene expression data is to classify genes with similar expression patterns into the same groups. To this end, hierarchical clustering algorithms have been widely used. Major advantages of hierarchical clustering algorithms are that investigators do not need to specify the number of clusters in advance and results are presented visually in the form of a dendrogram. However, since traditional hierarchical clustering methods simply provide results on the statistical characteristics of expression data, biological interpretations of the resulting clusters are not easy, and it requires laborious tasks to unveil hidden biological processes regulated by members in the clusters. Therefore, it has been a very difficult routine for experts. OBJECTIVE: Here, we propose a novel algorithm in which cluster boundaries are determined by referring to functional annotations stored in genome databases. MATERIALS AND METHODS: The algorithm first performs hierarchical clustering of gene expression profiles. Then, the cluster boundaries are determined by the Variance Inflation Factor among the Gene Function Vectors, which represents distributions of gene functions in each cluster. Our algorithm automatically specifies a cutoff that leads to functionally independent agglomerations of genes on the dendrogram derived from similarities among gene expression patterns. Finally, each cluster is annotated according to dominant gene functions within the respective cluster. RESULTS AND CONCLUSIONS: In this paper, we apply our algorithm to two gene expression datasets related to cell cycle and cold stress response in budding yeast Saccharomyces cerevisiae. As a result, we show that the algorithm enables us to recognize cluster boundaries characterizing fundamental biological processes such as the Early G1, Late G1, S, G2 and M phases in cell cycles, and also provides novel annotation information that has not been obtained by traditional hierarchical clustering methods. In addition, using formal cluster validity indices, high validity of our algorithm is verified by the comparison through other popular clustering algorithms, K-means, self-organizing map and AutoClass.  相似文献   

11.
Microarray-based gene expression profiling has emerged as an efficient technique for classification, prognosis, diagnosis, and treatment of cancer. Frequent changes in the behavior of this disease generates an enormous volume of data. Microarray data satisfies both the veracity and velocity properties of big data, as it keeps changing with time. Therefore, the analysis of microarray datasets in a small amount of time is essential. They often contain a large amount of expression, but only a fraction of it comprises genes that are significantly expressed. The precise identification of genes of interest that are responsible for causing cancer are imperative in microarray data analysis. Most existing schemes employ a two-phase process such as feature selection/extraction followed by classification. In this paper, various statistical methods (tests) based on MapReduce are proposed for selecting relevant features. After feature selection, a MapReduce-based K-nearest neighbor (mrKNN) classifier is also employed to classify microarray data. These algorithms are successfully implemented in a Hadoop framework. A comparative analysis is done on these MapReduce-based models using microarray datasets of various dimensions. From the obtained results, it is observed that these models consume much less execution time than conventional models in processing big data.  相似文献   

12.
胃癌基因表达谱的cDNA微阵列与聚类分析   总被引:7,自引:0,他引:7  
目的 分析胃癌与非肿瘤胃组织中基因表达特征,探讨其生物学意义。方法 提取18例进展期胃癌患者术前未行治疗的新鲜肿瘤和非肿瘤胃组织总RNA,逆转录标记cy5和cy3制备cDNA探针,与148个基因组成的cDNA微阵列杂交,应用平均联接等级聚类和微阵列数据显著差异分析(significance analysis of microarrays,SAM)方法分析146个符合入选条件基因的实验数据。结果 胃癌与非肿瘤胃组织各被聚为一类,胃癌和非肿瘤胃组织又分别聚为两个亚类。基因在两种组织表达有3个特征,明显基因表达差异表现在特征B和特征C.特征B基因在胃癌组织呈低表达或不表达,特征C基因在胃癌组织呈高表达。在特征A,T2-S2亚类与T1和T2-S1亚类的基因表达存在差异性,然而13例患者的配对胃癌与非肿瘤胃组织有相似基因表达。结合SAM分析,从特征B和特征C分别检出19个和12个在两种组织间呈差异性表达基因。结论 cDNA微阵列实验结果客观地反映了胃癌和非肿瘤胃组织的基因表达特征,可以将胃癌与非肿瘤胃组织各聚为一类.胃癌组织之间基因表达既有相似性,又有异质性,反映了胃癌基因表达变异的复杂性.应用cDNA微阵列技术研究胃癌基因差异性表达特征,有助于阐明胃癌发生、发展的分子基础,为胃癌早期诊断和预后评估的生物标记物研究提供科学依据.  相似文献   

13.
Gibbons FD  Roth FP 《Genome research》2002,12(10):1574-1581
We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results.  相似文献   

14.
Classification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer.  相似文献   

15.
目的寻找与肿瘤相关的基因诊疗中差异表达基因提取的方法。方法将基因表达谱数据进行预处理,采用相对风险方法筛选出差异表达基因特征子集,计算其样本间距离,然后对特征基因加权排序和过滤冗余基因,最后应用分类器对卵巢癌基因数据集进行分析,测试该方法的有效性。结果选取20维特征基因,进行分类测试,当特征基因为3~5、7和12~20维时,分类准确率可以达到100%,假阳性率可以达到0,表现出较好的可靠性,能够有效地将2个样本类型分开。结论经分类器测试证明,分类精度高,效果优于使用传统的基因差异表达分析方法。  相似文献   

16.
Gene selection is important for cancer classification based on gene expression data, because of high dimensionality and small sample size. In this paper, we present a new gene selection method based on clustering, in which dissimilarity measures are obtained through kernel functions. It searches for best weights of genes iteratively at the same time to optimize the clustering objective function. Adaptive distance is used in the process, which is suitable to learn the weights of genes during the clustering process, improving the performance of the algorithm. The proposed algorithm is simple and does not require any modification or parameter optimization for each dataset. We tested it on eight publicly available datasets, using two classifiers (support vector machine, k-nearest neighbor), compared with other six competitive feature selectors. The results show that the proposed algorithm is capable of achieving better accuracies and may be an efficient tool for finding possible biomarkers from gene expression data.  相似文献   

17.
We report the application of a support vector machine (SVM) for the development of diagnostic algorithms for optical diagnosis of cancer. Both linear and nonlinear SVMs have been investigated for this purpose. We develop a methodology that makes use of SVM for both feature extraction and classification jointly by integrating the newly developed recursive feature elimination (RFE) in the framework of SVM. This leads to significantly improved classification results compared to those obtained when an independent feature extractor such as principal component analysis (PCA) is used. The integrated SVM-RFE approach is also found to outperform the classification results yielded by traditional Fisher's linear discriminant (FLD)-based algorithms. All the algorithms are developed using spectral data acquired in a clinical in vivo laser-induced fluorescence (LIF) spectroscopic study conducted on patients being screened for cancer of the oral cavity and normal volunteers. The best sensitivity and specificity values provided by the nonlinear SVM-RFE algorithm over the data sets investigated are 95 and 96% toward cancer for the training set data based on leave-one-out cross validation and 93 and 97% toward cancer for the independent validation set data. When tested on the spectral data of the uninvolved oral cavity sites from the patients it yielded a specificity of 85%.  相似文献   

18.
目的从基因表达谱水平探讨2株人成纤维细胞和低剂量电离辐射3个时间点的可能交互效应,掘其差异表达基因。方法应用Gene Sifter在线软件和Panther生物学信息数据库,对下载于NCBI的GEO(Gene Expression Omnibus)数据库的12个样品GSM(数据包含2株成纤维细胞和每株细胞3个辐照时间点).运用混合设计二因子的Two-Way ANOVA方法进行数据挖掘,并对差异表达基因进行聚类、主成分分析、模式分类、功能归类分析,最后用GenCLiP软件进行文献挖掘。结果获得41条交互效应的差异表达基因,主成分分析示每株细胞24h时间点的特征向量明显远离其他向量,差异基因表达可分为3种模式。基因功能归类分析提示,多个生物通路如有丝分裂、细胞周期、细胞结构、细胞凋亡等被显著激活,其中参与有丝分裂和细胞周期的差异表达基因由细胞骨架基因、微管运动基因、蛋白激酶基因等构成,细胞凋亡等效应获得文献支持。结论2株人成纤维细胞和低剂量电离辐射3个时间点有交互效应,存在基因差异表达,这些基因有可能成为鉴定成纤维细胞和辐照时间交互效应的生物标记。  相似文献   

19.
20.
目的转录组测序技术为研究特定组织细胞生理状态和分子水平变化提供有力方法。为了建立分子水平变化和组织细胞生理功能之间的关系并排除随机因素的干扰,需要建立基于系列RNA-seq数据的表达模式分析方法。方法本文提出了一种整合的方法(geneexpressionclustermethod,GECluster)用于对系列样本模式聚类。整合曲线拟合以及信息熵建立模型并提取特征属性,最后按照上面模型提供的特征属性对基因进行层次聚类分析。结果表达趋势一致的基因被很好地聚到一个类别中,功能富集分析发现这些基因具有很强的功能相关性,并与文献报道相吻合。结论GECluster可以更灵活客观对多样本系列RNA—seq数据挖掘共表达基因,为后期功能分析提供了更有效的研究方案。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号