首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Gibbons FD  Roth FP 《Genome research》2002,12(10):1574-1581
We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results.  相似文献   

2.
OBJECTIVE: Clustering algorithms may be applied to the analysis of DNA microarray data to identify novel subgroups that may lead to new taxonomies of diseases defined at bio-molecular level. A major problem related to the identification of biologically meaningful clusters is the assessment of their reliability, since clustering algorithms may find clusters even if no structure is present. METHODOLOGY: Recently, methods based on random "perturbations" of the data, such as bootstrapping, noise injections techniques and random subspace methods have been applied to the problem of cluster validity estimation. In this framework, we propose stability measures that exploits the high dimensionality of DNA microarray data and the redundancy of information stored in microarray chips. To this end we randomly project the original gene expression data into lower dimensional subspaces, approximately preserving the distance between the examples according to the Johnson-Lindenstrauss (JL) theory. The stability of the clusters discovered in the original high dimensional space is estimated by comparing them with the clusters discovered in randomly projected lower dimensional subspaces. The proposed cluster-stability measures may be applied to validate and to quantitatively assess the reliability of the clusters obtained by a large class of clustering algorithms. RESULTS AND CONCLUSION: We tested the effectiveness of our approach with high dimensional synthetic data, whose distribution is a priori known, showing that the stability measures based on randomized maps correctly predict the number of clusters and the reliability of each individual cluster. Then we showed how to apply the proposed measures to the analysis of DNA microarray data, whose underlying distribution is unknown. We evaluated the validity of clusters discovered by hierarchical clustering algorithms in diffuse large B-cell lymphoma (DLBCL) and malignant melanoma patients, showing that the proposed reliability measures can support bio-medical researchers in the identification of stable clusters of patients and in the discovery of new subtypes of diseases characterized at bio-molecular level.  相似文献   

3.
Yang Q  Sze SH 《Genome research》2008,18(6):949-956
An important strategy to study operons and their evolution is to investigate clustering of related genes across multiple bacterial genomes. Although existing algorithms are available that can identify gene clusters across two or more genomes, very few algorithms are efficient enough to study gene clusters across hundreds of genomes. We observe that a querying strategy can be used to analyze gene clusters across a large number of genomes and develop an efficient algorithm to identify all related clusters on a genome from a given query cluster. We use this algorithm to study gene clustering in 400 bacterial genomes by starting from a well-characterized list of operons in Escherichia coli K12 and perform comparative analysis of operon occurrences, gene orientations, and rearrangements both within and across clusters. We show that important biological insights can be obtained by comparing results across these categories. A software program implementing the algorithm (GCQuery) and supplementary data containing detailed results are available at http://faculty.cs.tamu.edu/shsze/gcquery.  相似文献   

4.
OBJECTIVE: SiMCAL 1 (simple multilevel clustering and linking, version 1) is a novel clustering algorithm for time-series microarray data, presented here with an application to a specific data set. The purpose of the algorithm is to present a complete feature set not found in either Jarvis-Patrick clustering, from which it is derived, or in other popular clustering methods such as hierarchical and k-means. The data concern the activity of the phosphatidylserine receptor (PSR) which is believed to be a crucial molecular switch in the mediation of inflammatory response in apoptosis and lysis. By analyzing the behavior of PSR-related genes in mouse macrophages, we hope to elucidate the mechanisms involved in this important biological process. METHODS AND MATERIALS: SiMCAL 1 is implemented in the Python programming language using the Numerical Python extensions, and the data are stored using the MySQL database management system. The data are derived from exposures of multiple Affymetrix mouse gene microarray chips to elevated levels of PSR antibody and control conditions. Code and data are available at (accessed: 17 January 2005). RESULTS: The algorithm meets its objectives: it is simple, in that it is computationally inexpensive; it is multilevel, in that it provides a small number of clearly defined hierarchical levels of clusters; and it offers linking between clusters at the same level in each hierarchy. Clustering and linking results indicate previously unknown co-regulation for genes expressing PGH synthase (COX2) and PGE2, appear to confirm increased production of proteins for clearance of apoptotic cells in the presence of PSR antibody, and correspond to other findings regarding the temporal relationship between PGE2 production and B cell proliferation and differentiation. These results are promising but should be taken as highly preliminary. CONCLUSION: Both the algorithm and its application to this problem show great potential for future development. We plan to improve and extend the SiMCAL family of algorithms, and to obtain new data so that the algorithm(s) may be further applied to this and other problems of interest.  相似文献   

5.
6.
A drastic improvement in the analysis of gene expression has lead to new discoveries in bioinformatics research. In order to analyse the gene expression data, fuzzy clustering algorithms are widely used. However, the resulting analyses from these specific types of algorithms may lead to confusion in hypotheses with regard to the suggestion of dominant function for genes of interest. Besides that, the current fuzzy clustering algorithms do not conduct a thorough analysis of genes with low membership values. Therefore, we present a novel computational framework called the “multi-stage filtering-Clustering Functional Annotation” (msf-CluFA) for clustering gene expression data. The framework consists of four components: fuzzy c-means clustering (msf-CluFA-0), achieving dominant cluster (msf-CluFA-1), improving confidence level (msf-CluFA-2) and combination of msf-CluFA-0, msf-CluFA-1 and msf-CluFA-2 (msf-CluFA-3). By employing double filtering in msf-CluFA-1 and apriori algorithms in msf-CluFA-2, our new framework is capable of determining the dominant clusters and improving the confidence level of genes with lower membership values by means of which the unknown genes can be predicted.  相似文献   

7.
This paper describes a new technique for clustering short time series of gene expression data. The technique is a generalization of the template-based clustering and is based on a qualitative representation of profiles which are labelled using trend Temporal Abstractions (TAs); clusters are then dynamically identified on the basis of this qualitative representation. Clustering is performed in an efficient way at three different levels of aggregation of qualitative labels, each level corresponding to a distinct degree of qualitative representation. The developed TA-clustering algorithm provides an innovative way to cluster gene profiles. We show the developed method to be robust, efficient and to perform better than the standard hierarchical agglomerative clustering approach when dealing with temporal dislocations of time series. Results of the TA-clustering algorithm can be visualized as a three-level hierarchical tree of qualitative representations and as such easy to interpret. We demonstrate the utility of the proposed algorithm on a set of two simulated data sets and on a study of gene expression data from S. cerevisiae.  相似文献   

8.
Clustering algorithms have been shown to be useful to explore large-scale gene expression profiles. Visualization and objective evaluation of clusters are two important considerations when users are selecting different clustering algorithms, but they are often overlooked. The developments of a framework and software tools that implement comprehensive data visualization and objective measures of cluster quality are crucial. In this paper, we describe a theoretical framework and formalizations for consistently developing clustering algorithms. A new clustering algorithm was developed within the proposed framework. We demonstrate that a theoretically sound principle can be uniformly applied to the developments of cluster-optimization function, comprehensive data-visualization strategy, and objective cluster-evaluation measures as well as actual implementation of the principle. Cluster consistency and quality measures of the algorithm are rigorously evaluated against those of popular clustering algorithms for gene expression data analysis (K-means and self-organizing maps), in four data sets, yielding promising results.  相似文献   

9.
Fuzzy c-means clustering with prior biological knowledge   总被引:1,自引:0,他引:1  
We propose a novel semi-supervised clustering method called GO Fuzzy c-means, which enables the simultaneous use of biological knowledge and gene expression data in a probabilistic clustering algorithm. Our method is based on the fuzzy c-means clustering algorithm and utilizes the Gene Ontology annotations as prior knowledge to guide the process of grouping functionally related genes. Unlike traditional clustering methods, our method is capable of assigning genes to multiple clusters, which is a more appropriate representation of the behavior of genes. Two datasets of yeast (Saccharomyces cerevisiae) expression profiles were applied to compare our method with other state-of-the-art clustering methods. Our experiments show that our method can produce far better biologically meaningful clusters even with the use of a small percentage of Gene Ontology annotations. In addition, our experiments further indicate that the utilization of prior knowledge in our method can predict gene functions effectively. The source code is freely available at http://sysbio.fulton.asu.edu/gofuzzy/.  相似文献   

10.
Assessing clusters and motifs from gene expression data   总被引:4,自引:0,他引:4       下载免费PDF全文
Jakt LM  Cao L  Cheah KS  Smith DK 《Genome research》2001,11(1):112-123
  相似文献   

11.
Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of false positives, (2) a new clustering algorithm designed specifically for grouping gene expression patterns, and (3) an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups.  相似文献   

12.
Holland PW 《Journal of anatomy》2001,199(PT 1-2):13-23
The arrangement of Hox genes into physical clusters is fundamental to the patterning of animal body plans, through the phenomenon of colinearity. Other homeobox genes are often described as dispersed, implying they are not arranged into clusters. Contrary to this view, however, two clusters of non-Hox homeobox genes have been reported: the amphioxus ParaHox gene cluster and the Drosophila 93D/E cluster (referred to here as the NKL cluster). Here I examine the antiquity of these gene clusters, their conservation and their pattern of evolution in vertebrate genomes. I argue that the ParaHox gene cluster arose early in animal evolution, and duplicated in vertebrates to give the four clusters in human and mouse genomes. The NKL cluster is also ancient, and also duplicated to yield four descendent clusters in mammalian genomes. The NKL and Hox gene clusters were originally chromosomal neighbours, within an ancient and extensive array of at least 30 related homeobox genes. There is no necessary relationship between clustering and colinearity, although it is argued that the ParaHox gene cluster does show modified spatial colinearity. A novel hypothesis for the evolution of ParaHox gene expression in deuterostomes is presented.  相似文献   

13.
Beyond the Hox: how widespread is homeobox gene clustering?   总被引:4,自引:0,他引:4  
The arrangement of Hox genes into physical clusters is fundamental to the patterning of animal body plans, through the phenomenon of colinearity. Other homeobox genes are often described as dispersed, implying they are not arranged into clusters. Contrary to this view, however, two clusters of non-Hox homeobox genes have been reported: the amphioxus ParaHox gene cluster and the Drosophila 93D/E cluster (referred to here as the NKL cluster). Here I examine the antiquity of these gene clusters, their conservation and their pattern of evolution in vertebrate genomes. I argue that the ParaHox gene cluster arose early in animal evolution, and duplicated in vertebrates to give the four clusters in human and mouse genomes. The NKL cluster is also ancient, and also duplicated to yield four descendent clusters in mammalian genomes. The NKL and Hox gene clusters were originally chromosomal neighbours, within an ancient and extensive array of at least 30 related homeobox genes. There is no necessary relationship between clustering and colinearity, although it is argued that the ParaHox gene cluster does show modified spatial colinearity. A novel hypothesis for the evolution of ParaHox gene expression in deuterostomes is presented.  相似文献   

14.
In microarray gene expression data, clusters may hide in certain subspaces. For example, a set of co-regulated genes may have similar expression patterns in only a subset of the samples in which certain regulating factors are present. Their expression patterns could be dissimilar when measuring in the full input space. Traditional clustering algorithms that make use of such similarity measurements may fail to identify the clusters. In recent years a number of algorithms have been proposed to identify this kind of projected clusters, but many of them rely on some critical parameters whose proper values are hard for users to determine. In this paper, a new algorithm that dynamically adjusts its internal thresholds is proposed. It has a low dependency on user parameters while allowing users to input some domain knowledge should they be available. Experimental results show that the algorithm is capable of identifying some interesting projected clusters.  相似文献   

15.
16.
Phenotypic and molecular parallels between the development of chondrosarcoma and the differentiation of chondrocytes in normal growth plate suggest that chondrosarcoma may arise from mesenchymal precursor cells driven towards chondrogenesis. We hypothesized that a comparison between cartilaginous tumours and their possible physiological cells of origin, mesenchymal stem cells (MSCs), might have biological and clinical relevance. MSCs from eight donors were submitted to chondrogenic differentiation in spheroid cultures. Expression profiles of MSCs at days 0, 7, 14, 28 and 42 of chondrogenesis and of 18 chondrosarcomas with different histological grades were studied using a customized cDNA array. Hierarchical clustering of MSC gene expression during chondrogenesis allowed the classification of samples in a pre-chondrogenic and a chondrogenic cluster corresponding to the phenotypes of early and late differentiation stages. The 74 genes differentially expressed between the two clusters were defined as chondrogenesis-relevant genes. Gene expression profiles of chondrosarcoma were submitted to hierarchical clustering on the basis of these chondrogenesis-relevant genes. This analysis allowed clear distinction between grade I and grade III chondrosarcoma and separated grade II chondrosarcoma into two groups. All grade II chondrosarcomas with occurrence of metastasis were found together with the grade III chondrosarcomas in the pre-chondrogenic cluster. This analysis shows that a molecular approach based on the comparison of tumour samples to an in vitro model for chondrogenic differentiation allows a new classification of chondrosarcoma in two clusters. These data suggest that the identification of a pre-chondrogenic and a chondrogenic phenotype for chondrosarcoma by gene expression profiling could develop into a useful tool to predict the clinical behaviour of chondrosarcoma. Copyright (c) 2008 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.  相似文献   

17.
Three important areas of data analysis for global gene expression analysis are class discovery, class prediction, and finding dysregulated genes (biomarkers). The clinical application of microarray data will require marker genes whose expression patterns are sufficiently well understood to allow accurate predictions on disease subclass membership. Commonly used methods of analysis include hierarchical clustering algorithms, t-, F-, and Z-tests, and machine learning approaches. We describe an approach called the maximum difference subset (MDSS) algorithm that combines classification algorithms, classical statistics, and elements of machine learning and provides a coherent framework. By integrating prediction accuracy, the MDSS algorithm learns the critical threshold of statistical significance (the alpha or P-value), eliminating the arbitrariness of setting a threshold of statistical significance and minimizing the effect of the normality assumptions. To reduce the false positive rate and to increase external validity of the predictive gene set, a jackknife step is used. This step identifies and removes genes in the initial MDSS with low combined predictive utility. The overall MDSS provides a prediction that is less dependent on an arbitrary study design (sample inclusion or exclusion) and should thus have high external validity. We demonstrate that this approach, unlike other published methods, identifies biomarkers capable of predicting the outcome of anthracycline-cytarabine chemotherapy in cases of acute myeloid leukemia. By incorporating two criteria-statistical significance and predictive utility-the approach learns the significance level relevant for a given data set. The MDSS approach can be used with any test and classifier operator pair.  相似文献   

18.
BackgroundHigh-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM).MethodsData sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means.ResultsWard clustering imposed cluster structures on cluster-less “golf ball”, “cuboid” and “S-shaped” data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data.ConclusionsThe present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.  相似文献   

19.
Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm.  相似文献   

20.
Many problems in the field of biomedical signal processing can be reduced to a task of state recognition and event prediction. Examples can be found in tachycardia detection from ECG signals, epileptic seizure or psychotic attack prediction from an EEG signal, and prediction of vehicle drivers falling asleep from both signals. The problem generally treats a set of ordered measurements and asks for the recognition of some patterns of observed elements that will forecast an event or a transition between two different states of the biological system. It is proposed to apply clustering methods to grouping discontinuous related temporal patterns of a continuously sampled measurement. The vague switches from one stationary state to another are naturally treated by means of fuzzy clustering. In such cases, an adaptive selection of the number of clusters (the number of underlying semi-stationary processes) can overcome the general non-stationary nature of biomedical signals and enable the formation of a warning cluster. The algorithm suggested for the clustering is a new recursive algorithm for hierarchical fuzzy partioning. Each pattern can have a non-zero membership in more than one data subset in the hierarchy. A ‘natural’ and feasible solution to the cluster validity problem is suggested by combining hierarchical and fuzzy concepts. The algorithm is shown to be effective for a variety of data sets with a wide dynamic range of both covariance matrices and number of members in each class. The new method is applied to state recognition during recovery from exercise using the heart rate signal and to the forecasting of generalised epileptic seizures from the EEG signal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号