首页 | 本学科首页   官方微博 | 高级检索  
     

随机森林的并行运算方法及适用条件
引用本文:顾星博,温琪,史晓雯,刘艳. 随机森林的并行运算方法及适用条件[J]. 实用预防医学, 2016, 23(2): 129-132
作者姓名:顾星博  温琪  史晓雯  刘艳
作者单位:哈尔滨医科大学卫生统计学教研室黑龙江 哈尔滨 150081
基金项目:基金项目:国家自然科学基金(81172741,30972537)
摘    要:目的探讨随机森林并行运算的实现方法及其适用条件,为基因组学数据分析提供科学参考.方法 基于R foreach包编写随机森林并行运算程序,并利用SNPs模拟数据探究其表现.结果 在SNPs位点数量为100,500,1000时,随工作站所占用CPU数量的增多,随机森林并行运算方法的提速效果呈非线性趋势,且位点数量相同但ntree数量不同时速度的提升效果亦不相同;当SNPs位点数量达到5000时,该方法提速效果较差,10核环境下ntree为500和1000时几乎无提速效果,即使ntree达到5000或10000时提速效果也不超过2倍.结论 基于R foreach包的随机森林并行运算方法在SNPs位点数量不是很多(如<1000)的情况下其提速效果尚可;但由于共享内存等产生的通信开销的问题的存在,当SNPs位点数较多(超过5000)时,该方法提速效果很差,此时可考虑选择其他分析工具如随机丛林(RJ,Random Jungle ).

关 键 词:大数据  随机森林  并行运算  单核苷酸多态性  

Parallel Random Forest method and applicable condition
GU Xingbo,WEN Qi,SHI Xiaowen. Parallel Random Forest method and applicable condition[J]. Practical Preventive Medicine, 2016, 23(2): 129-132
Authors:GU Xingbo  WEN Qi  SHI Xiaowen
Affiliation:Department of Biostatistics, School of Public Health, Harbin Medical University, Harbin 150081, Heilongjiang ,China
Abstract:ObjectiveTo explore the implementation method of Parallel Random Forest and its applicable condition and provide scientific reference for genomics data analysis. MethodProgramming the Parallel Random Forest computing program based on R foreach package and using the SNPs simulated data to evaluate its performance. ResultWhen the number of SNPs is 100,500,1000,performance gains are not linear with the number of CPUs increased. And the same amount of data under the condition of different numbers of ntree,the performance gains difference also. When the number of SNPs reaches 5000, the performance of this method is relatively low. When the number of ntree is 5000,10000 under the 10 CPUs environment, the performance is less than 2 times better than sequential job and there is almost no speed gains ConclusionWhen the number of SNPs is not a lot(less than 1000),performance of the Parallel Random Forest computing program based on R foreach package is better. However,if the number of SNPs is high(over 5000),due to the existence of shared memory that can generate communication overhead problems, this method is poor,then we can consider to choose other analysis tools,like Random Jungle .
Keywords:Big Data  Random Forest  Parallel Computation  SNPs  
本文献已被 CNKI 等数据库收录!
点击此处可从《实用预防医学》浏览原始摘要信息
点击此处可从《实用预防医学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号