首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
基因表达数据聚类分析结果的评价方法研究   总被引:3,自引:0,他引:3  
目的:本文探讨基因表达数据聚类分析结果的评价方法,提供一种最佳聚类结果的判别准则。方法:从数据结构(内部信息)和功能分类(外部信息)两个方面对聚类结果进行评判。即一方面,采用Entropy(信息熵)评判法,考察聚类结果与部分已知功能基因分类的符合程度;另一方面,采用adjust-FOM评价法,从数据结构的本身进行评价。我们综合两种方法得到一种新的评价方法,并称此方法为Entropy-FOM评价方法,结果:将该方法应用于Lyer的血清数据集和Ferea的酵母数据集对聚类分析结果进行了评价,给出了六种聚类方法的adjust-FOM图和Entropy-FOM图。结果:通过大量计算结果提示,谱聚类SOM方法和模糊聚类方法有相对高的聚类效能。  相似文献   

2.
目的探讨适宜于集中式供水领域特点的风险评价方法。方法选择我国1 302座集中式供水设施,并计算其风险指数。采用两阶段聚类计算最优分类数。采用系统聚类和K-means聚类两种方法,并以最优分类数和标准推荐的三分类分别对研究对象的风险指数进行聚类分析。结果两阶段聚类显示最优分类数为4类;四分类系统聚类结果显示,97.8%的对象为2类;三分类系统聚类结果显示,98.5%的对象为1类;四分类的K-means聚类结果显示,14.0%、16.0%、30.6%和39.5%的对象被分别分为1~4类;三分类聚类结果显示,26.3%、34.2%和39.5%的对象被分别分为1~3类。结论集中式供水风险评价适宜分为分高、较高、较低、低四类;K-means聚类方法更适宜应用于该领域。  相似文献   

3.
目的 将可拓聚类方法用于对医院年收治病人数的预测.方法 构造可拓聚类预测模型,判断待聚类样本所属类别,对其进行预测.结果 利用1981~1989年的相关因素数据资料构造可拓聚类预测模型,对1990年的收治病人数进行预测,结果与实际相符.结论 可拓聚类方法用于对医院年收治病人数的预测是有效可行的,可以为医院管理与统计提供一种新的方法.  相似文献   

4.
目的:对14种常用抗肿瘤中药中无机元素含量进行因子分析与聚类分析,探讨该类药物中无机元素含量与抗肿瘤药效的相关性。方法:采用因子分析法建立因子模型合理解释各无机元素之间以及与中药药效的相关关系;聚类分析法将14种抗肿瘤中药进行指标聚类验证因子分析的结论,具有层次性地说明各无机元素之间的相关关系,进行样品聚类说明分类中药之间存在较强的相似性。结果:14种抗肿瘤药物可分为清热解毒类和以毒攻毒两大类,因子分析与聚类分析结果一致。结论:基于每种中药都有各自元素特征谱如中药中所含元素含量的不同,通过数据挖掘技术合理地解释药物药效与元素之间的相关性,以及中药间相关系数较大的其性味、功效相似程度也较大的特点,从而为中药有机化学成分及其配合物在药效的发挥和生物利用度的提高的角度为中药质量评价提供科学理论依据。  相似文献   

5.
目的 利用聚类分析方法探讨全球甲型流行性感冒病毒(流感)H3亚型抗原的变异规律。方法下载NCBI GenBank和流感病毒数据库中全部的甲型流感病毒RNA4节段H3亚型基因序列,在Clustalx中进行序列对齐后,使用两阶段聚类法进行分析,并随后探讨各类的三间分布。结果 所有序列可被分为10类,其中7类主要为人流感病毒,人流感病毒和鸟类、其他哺乳动物的流感病毒被明确的分为不同类别,但和猪流感病毒则共存于数个类中。各类呈现出明显的时间分布和宿主分布规律,但并未发现地域分布规律。结论 由于受到人类免疫系统的选择压力,H3抗原呈现出5-7年出现一次较大变异的流行特征,且这一趋势随着近十年来流感疫苗的普遍使用而呈现加速趋势。同时,猪流感病毒和人流感病毒出现在同一类别中,两者的遗传距离较近,这为猪作为病毒重配的载体提供了新的佐证。  相似文献   

6.
主成分分析与聚类分析在民族分化研究中的应用比较   总被引:10,自引:0,他引:10  
目的 比较主成分分析与聚类分析两种聚类方法对13个人群进行分类的结果。方法 采用两种数值分类方法并用Y染色体的12种单体型的双等位基因频率数据,对朝鲜族等13个人群进行分类,分析群体间,阐明民族的起源。结果 两种分类方法得到的结果不尽相同。主成分分析可以减少无关指标的影响,但是在简化数据降低维数的过程中又有可能丢失信息。聚类分析充分利用原始数据信息,但无法排除无关指标的“噪音”干扰。结论 主成分分析与聚类分析都适宜做多维复杂数据的分类研究,但在实际应用中,应运用两种分类方法得到的结果结合领域知识给出客观、合理的结论。  相似文献   

7.
目的探索基于自组织映射(SOM)的数据挖掘方法从历史积累的临床实验室检验数据中发现知识的能力,并尝试利用本体方法将该领域相关知识进行标准化表达。方法本研究的基本思想是采用自组织映射神经网络技术,从过去积累的数据中挖掘出专家经验基础上的临床检验项目的应用规律(知识)。以2009--2011年西安市两所综合医院的内科门诊患者就诊资料为训练样本,采用自组织映射方法建立门诊患者的聚类模型。采用通用的本体构建原则和方法,提取领域内相关概念及相互关系,建立临床实验室检验领域本体框架。结果SOM网络将就诊患者聚为8个类别,每个类别都具有较明显的临床意义,即每个类别中的患者具有相似的临床实验室检验项目的应用规律。聚类结果可以涵盖69.73%的患者。患者性别、年龄、3年累计临床检验项目数、疾病特征对聚类模型的贡献较大。构建了临床实验室检验领域本体框架,该框架的构建过程符合本体的建立准则,本体中的概念及其关系相对完整。结论自组织映射方法对门诊患者聚类效果较好,说明采用数据挖掘方法从历史数据中发现临床实验室检验项目的应用规律是可行的。采用本体方法将该领域知识进行标准化表达是一种有益的尝试。  相似文献   

8.
目的测定90份婴幼儿配方奶粉中的26种元素,并对其矿物质含量进行分析,为消费者理性购买提供依据。方法采用Agilent 7900电感耦合等离子体质谱法(inductively coupled plasma mass spectrometry,ICP-MS),在氦气模式下对婴幼儿配方奶粉中26种元素进行测定,并对外包装上标明的9种矿物质元素进行相关性和聚类分析。结果对奶粉样品中9种矿物质元素进行聚类分析,发现可分为3类,且这3个类别与各奶粉的分段呈强正相关关系。结论 90份婴幼儿配方奶粉中的矿物质元素含量,不同品牌、不同产地、国产与进口产品间基本类似;将样品奶粉聚为3类,这3个类别与品牌、产地与是否进口无关,但与奶粉的分段有关。  相似文献   

9.
目的 研究聚类分析中系统聚类法的某些聚类过程对聚类结果产生的干扰,寻找消除该干扰的聚类过程。方法 利用图论和模糊数学中的最大树聚类法为标准,对不同的聚类过程进行分析,找出系统聚类法中某些聚类过程给聚类结果带来的严重影响的原因。结果给出能消除系统聚类法中某些聚类过程给聚类结果带来严重影响的统一的(指标或样品)聚类过程。结论 统一的(指标或样品)聚类过程消除了系统聚类法中某些聚类过程给聚类结果带来的严重影响;不但保留了系统聚类法中聚类过程的优点,而且还能挖掘出隐藏在原始数据中的有用信息。  相似文献   

10.
利用基因表达谱对组织样品分类方法的研究   总被引:5,自引:0,他引:5  
目的 探讨在基因表达谱数据分析中用主成分分析结合层次聚类法与K-均值聚类方法两种分类方法对组织样品分类的分类效果。方法 用主成分分析方法对数据降维后进行聚类分析,与不经主成分分析直接聚类,并结合筛选与组织样品分型相关的基因的各种筛选水平,评价聚类效果。结果 用约当指数进行评价两种聚类方法:经主成分分析后用提取的主成分聚类与不同主成分的直接聚类效果不同;不同筛选相关基因的筛选水平对聚类效果也有影响。结论 对组织样品做聚类分析时,主成分分析能提高聚类质量,合理地筛选差异表达基因的方法能提高聚类效果。  相似文献   

11.
Genome‐wide association studies (GWAS) have become a very effective research tool to identify genetic variants of underlying various complex diseases. In spite of the success of GWAS in identifying thousands of reproducible associations between genetic variants and complex disease, in general, the association between genetic variants and a single phenotype is usually weak. It is increasingly recognized that joint analysis of multiple phenotypes can be potentially more powerful than the univariate analysis, and can shed new light on underlying biological mechanisms of complex diseases. In this paper, we develop a novel variable reduction method using hierarchical clustering method (HCM) for joint analysis of multiple phenotypes in association studies. The proposed method involves two steps. The first step applies a dimension reduction technique by using a representative phenotype for each cluster of phenotypes. Then, existing methods are used in the second step to test the association between genetic variants and the representative phenotypes rather than the individual phenotypes. We perform extensive simulation studies to compare the powers of multivariate analysis of variance (MANOVA), joint model of multiple phenotypes (MultiPhen), and trait‐based association test that uses extended simes procedure (TATES) using HCM with those of without using HCM. Our simulation studies show that using HCM is more powerful than without using HCM in most scenarios. We also illustrate the usefulness of using HCM by analyzing a whole‐genome genotyping data from a lung function study.  相似文献   

12.
Predicting a phenotype and understanding which variables improve that prediction are two very challenging and overlapping problems in the analysis of high‐dimensional (HD) data such as those arising from genomic and brain imaging studies. It is often believed that the number of truly important predictors is small relative to the total number of variables, making computational approaches to variable selection and dimension reduction extremely important. To reduce dimensionality, commonly used two‐step methods first cluster the data in some way, and build models using cluster summaries to predict the phenotype. It is known that important exposure variables can alter correlation patterns between clusters of HD variables, that is, alter network properties of the variables. However, it is not well understood whether such altered clustering is informative in prediction. Here, assuming there is a binary exposure with such network‐altering effects, we explore whether the use of exposure‐dependent clustering relationships in dimension reduction can improve predictive modeling in a two‐step framework. Hence, we propose a modeling framework called ECLUST to test this hypothesis, and evaluate its performance through extensive simulations. With ECLUST, we found improved prediction and variable selection performance compared to methods that do not consider the environment in the clustering step, or to methods that use the original data as features. We further illustrate this modeling framework through the analysis of three data sets from very different fields, each with HD data, a binary exposure, and a phenotype of interest. Our method is available in the eclust CRAN package.  相似文献   

13.
We consider the problem of model‐based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.  相似文献   

14.
The production of increasingly reliable and accessible gene expression data has stimulated the development of computational tools to interpret such data and to organize them efficiently. The clustering techniques are largely recognized as useful exploratory tools for gene expression data analysis. Genes that show similar expression patterns over a wide range of experimental conditions can be clustered together. This relies on the hypothesis that genes that belong to the same cluster are coregulated and involved in related functions. Nevertheless, clustering algorithms still show limits, particularly for the estimation of the number of clusters and the interpretation of hierarchical dendrogram, which may significantly influence the outputs of the analysis process. We propose here a multi level SOM based clustering algorithm named Multi-SOM. Through the use of clustering validity indices, Multi-SOM overcomes the problem of the estimation of clusters number. To test the validity of the proposed clustering algorithm, we first tested it on supervised training data sets. Results were evaluated by computing the number of misclassified samples. We have then used Multi-SOM for the analysis of macrophage gene expression data generated in vitro from the same individual blood infected with 5 different pathogens. This analysis led to the identification of sets of tightly coregulated genes across different pathogens. Gene Ontology tools were then used to estimate the biological significance of the clustering, which showed that the obtained clusters are coherent and biologically significant.  相似文献   

15.
Clustered data are often encountered in biomedical studies, and to date, a number of approaches have been proposed to analyze such data. However, the phenomenon of informative cluster size (ICS) is a challenging problem, and its presence has an impact on the choice of a correct analysis methodology. For example, Dutta and Datta (2015, Biometrics) presented a number of marginal distributions that could be tested. Depending on the nature and degree of informativeness of the cluster size, these marginal distributions may differ, as do the choices of the appropriate test. In particular, they applied their new test to a periodontal data set where the plausibility of the informativeness was mentioned, but no formal test for the same was conducted. We propose bootstrap tests for testing the presence of ICS. A balanced bootstrap method is developed to successfully estimate the null distribution by merging the re‐sampled observations with closely matching counterparts. Relying on the assumption of exchangeability within clusters, the proposed procedure performs well in simulations even with a small number of clusters, at different distributions and against different alternative hypotheses, thus making it an omnibus test. We also explain how to extend the ICS test to a regression setting and thereby enhancing its practical utility. The methodologies are illustrated using the periodontal data set mentioned earlier. Copyright © 2017 John Wiley & Sons, Ltd.  相似文献   

16.
目的:利用微阵列芯片和生物信息学技术,分析溃疡性结肠炎(UC)中长链非编码RNA(lncRNA)表达特征。方法:选取医院确诊的3例维吾尔族UC患者,将其纳入观察组,另选3名维吾尔族健康体检者纳入对照组,运用lncRNA芯片检测两组结肠组织中lncRNAs和信使核糖核酸(mRNAs)表达谱,分析两组间差异表达的lncRNAs。运用在线生物信息学软件对差异lncRNAs的生物学功能和lncRNAs预测的靶向mRNAs进行全基因组关联分析。结果:芯片结果预处理分析中,两组的芯片结果在总体基因的表达一致。通过对lncRNA和mRNA的数据进行聚类分析发现,在两组样本中,lncRNA和mRNA的基因表达存在明显的差异。观察组患者病变结肠组织中表达的lnc RNAs筛选差异倍数≥2,共有1242个,其中表达上调579个,下调663个;差异表达的lncRNAs主要在免疫系统进展、炎症通路及B细胞的激活等功能上显著富集;基因座信息预测差异表达的lncRNA靶基因中有8个与UC发病机制相关的lncRNA-mRNA分子调控机制。结论:新疆维吾尔族UC患者与维吾尔族健康者的结肠组织存在差异表达的lncRNAs,且可能参与UC的致病调控过程。  相似文献   

17.
When a large number of genes are significant in correlating microarray gene expression data with patient prognosis, clustering of significant genes may be effective not only for further dimension reduction but also for identifying co-regulated genes that belong to the same molecular pathway related to disease biology and aggressiveness. Moreover, a reduced feature, such as the average expression across samples for a cluster of significant genes, can play an important role in reducing variance in prediction analysis. We propose a simple procedure to select gene clusters that have strong marginal association with survival outcome from a large pool of candidate hierarchical clusters of significant genes. Selected gene clusters can have better predictive capability than the other gene clusters and singleton genes. Application of such clustering to the data set from a clinical study for patients with multiple myeloma and associated microarrays is given.  相似文献   

18.
In this paper, we propose methods to cluster groups of two‐dimensional data whose mean functions are piecewise linear into several clusters with common characteristics such as the same slopes. To fit segmented line regression models with common features for each possible cluster, we use a restricted least squares method. In implementing the restricted least squares method, we estimate the maximum number of segments in each cluster by using both the permutation test method and the Bayes information criterion method and then propose to use the Bayes information criterion to determine the number of clusters. For a more effective implementation of the clustering algorithm, we propose a measure of the minimum distance worth detecting and illustrate its use in two examples. We summarize simulation results to study properties of the proposed methods and also prove the consistency of the cluster grouping estimated with a given number of clusters. The presentation and examples in this paper focus on the segmented line regression model with the ordered values of the independent variable, which has been the model of interest in cancer trend analysis, but the proposed method can be applied to a general model with design points either ordered or unordered. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

19.
Although gene‐environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome‐wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening‐testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome‐wide data sets, we incorporate a score statistic‐based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP‐education interactions relative to Alzheimer's disease status using genome‐wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).  相似文献   

20.
ObjectivesIn some developing countries, despite advancements in Information Technology (IT), medical resources are scarce; hence, introduction of telemedicine services can solve this problem. In this study, we examined the possibility of introducing telemedicine-based services in developing countries utilizing the available data.MethodsIn Asia, the study was conducted in nine developing countries, excluding those where data were unavailable. In Africa, thirteen countries whose per capita Gross Domestic Product (GDP) was less than USD 1000, and where data were unavailable, were also excluded. We chose the number of doctors, nurses, and midwives as indicators of the healthcare environment. We used the number of internet contracts and mobile phone contracts as indicators of IT penetration, and set per capita GDP and its growth rate as economic indicators. We combined the two continents’ data and performed a principal component analysis (PCA) and cluster analysis.ResultsWe used cluster analysis to classify the target countries into the following five clusters: Cluster A: Algeria, Egypt, Morocco, Indonesia, Ghana, Tunisia, Madagascar, Nigeria, and Thailand; Cluster B: Bangladesh, Ethiopia, Kenya, Uganda, India, and Pakistan; Cluster C: Sudan, Malaysia, Vietnam, Tanzania, Philippines, and China; Cluster D: South Africa, and Cluster E: Japan and Singapore. As a result of conducting PCA, Cluster A emerged as the region with the highest progressiveness and development possibility.ConclusionsIntroduction of telemedicine services has been visualized by using cluster analysis and PCA. However, it is necessary to incorporate future medical needs as indicators to make a more appropriate assessment of its potential.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号