首页 | 本学科首页   官方微博 | 高级检索  
检索        

高维DNA甲基化数据的随机森林降维分析
引用本文:张秋伊,赵杨,魏永越,张汝阳,陈峰.高维DNA甲基化数据的随机森林降维分析[J].中华疾病控制杂志,2016,20(6):630-633.
作者姓名:张秋伊  赵杨  魏永越  张汝阳  陈峰
作者单位:南京医科大学公共卫生学院生物统计学系, 江苏 南京 211166
基金项目:国家自然基金(81530088,81473070,81373102,61301251,81402764),江苏省高校优势学科建设专项(2014年),江苏省高等学校自然科学项目(12KJB310003),江苏省青蓝工程资助项目(2014年)
摘    要:目的 将随机森林算法用于类风湿性关节炎病例对照研究的高维甲基化数据的分析,并探讨应用效果。方法 实例数据来自基因表达数据库(gene expression omnibus,GEO),检索号为GSE42861,包含354名病例、335名对照,本文选取类风湿性关节炎相关基因区域所在的第9号染色体,共纳入2 433个胞嘧啶-磷酸-鸟嘌呤双核苷酸(cytosine-phosphate-guanine pairs of nucleotides,CpGs)位点。利用随机森林计算变量的重要性评分并排序;对排序后的变量进行逐步随机森林过程,寻找最有可能与结果存在关联的变量子集;对降维后的变量子集进行逐步Logistic回归。结果 逐步随机森林筛选出80个重要的CpG位点,Logistic回归模型中有13个位点具有统计学意义。纳入这些位点建立Logistic回归模型,该模型的预测正确率达88.29%。结论 随机森林算法可以大大减少噪音变量,提高检验效能,适用于高维甲基化数据分析。

关 键 词:关节炎  类风湿    DNA甲基化    流行病学方法
收稿时间:2015-12-26

The application of random forest for high dimensional DNA methylation data
Institution:Department of Biostatistics, School of Public Health, Nanjing Medical University, Nanjing 211166, China
Abstract:Objective To study the application of random forest algorithm for the high dimensional case-control DNA methylation data of rheumatoid arthritis(RA). Methods The RA dataset was obtained from gene expression omnbius (GEO) data repository (accession number GSE42861), which contained 689 samples (354 patients and 335 controls). A total of 2 433 cytosine-phosphate-guanine pairs of nucleotides(CpGs) sites on chromosome 9 were included because the identified RA associated area was located in this chromosome. First, these variables were sorted by the importance sores, by which were calculated through random forest. Second, stepwise random forest was carried out to find the subset variables which were most possibly associated with the outcome variable. Third, we conducted stepwise Logistic regression in the subset variables. Results Eighty important CpG sites were picked out by random forest. In our Logistic model, there were 13 statistically significant CpGs. The accuracy of the model contain these 13 CpGs was 88.29%. Conclusions Random forest algorithm can dramatically reduce the redundant variables and is applicable for high dimensional DNA methylation data.
Keywords:Arthritis  rheumatoid  DNA methylation  Epidemiologic methods
本文献已被 万方数据 等数据库收录!
点击此处可从《中华疾病控制杂志》浏览原始摘要信息
点击此处可从《中华疾病控制杂志》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号