首页 | 本学科首页   官方微博 | 高级检索  
     

基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用
引用本文:王文杰, 马金沙, 高倩, 王彤. 基于随机森林变量重要性评分的变量筛选方法及其在肿瘤分型诊断中的应用[J]. 中华疾病控制杂志, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005
作者姓名:王文杰  马金沙  高倩  王彤
作者单位:030001 太原,山西医科大学公共卫生学院卫生统计学教研室
基金项目:国家自然科学基金81872715国家自然科学基金82073674山西省科技重大专项项目202005D121008山西省重点研发计划项目202102130501003
摘    要:目的  探究高维组学数据中结局为二分类时基于随机森林(random forest, RF)变量重要性评分的变量筛选方法,并选择合适方法构建结局预测模型。方法  首先根据不同的变量筛选目标,对最小优化变量筛选类RF算法[递归特征消除(recursive feature elimination, RFE)-RF、biosigner]与全部相关变量筛选类RF算法(Boruta、vita、altmann、r2vim)在高维数据中识别重要变量的能力进行了模拟比较。然后结合不同方法优势用于弥漫大B细胞淋巴瘤(diffuse large B-cell lymphoma, DLBCL)分型相关基因的筛选,并构建DLBCL分型诊断模型。结果  模拟研究表明,vita方法的灵敏度较高,biosigner方法的阳性预测值较高。实例分析表明,经vita方法筛得1 019个与DLBCL分型相关的基因,后经biosigner方法筛得77个与DLBCL分型相关的基因。所建DLBCL分型诊断模型的受试者工作特征(receiver operating characteristical, ROC)曲线下面积(area under the ROC curve,AUC)为0.910。结论  vita及biosigner方法可用于DLBCL分型相关基因的初步和最终筛选阶段。由最终筛得基因所建立的模型可有效实现DLBCL的分型诊断。

关 键 词:随机森林   变量筛选   弥漫大B细胞淋巴瘤
收稿时间:2022-02-18
修稿时间:2022-05-23

Variable selection methods based on variable importance measurement from random forest and its application in diagnosis of tumor typing
WANG Wen-jie, MA Jin-sha, GAO Qian, WANG Tong. Variable selection methods based on variable importance measurement from random forest and its application in diagnosis of tumor typing[J]. CHINESE JOURNAL OF DISEASE CONTROL & PREVENTION, 2023, 27(3): 274-280. doi: 10.16462/j.cnki.zhjbkz.2023.03.005
Authors:WANG Wen-jie  MA Jin-sha  GAO Qian  WANG Tong
Affiliation:Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan 030001, China
Abstract:  Objective  To explore the variable selecting methods based on variable importance measurement from random forest (RF) for binary outcome in the high-dimensional omics data, and to choose the appropriate methods to construct the outcome prediction model.  Methods  First, according to the different variable selection objectives, we simulated and compared the ability of minimum optimized variable selection RF methods [recursive feature elimination (RFE)-RF, biosigner] and all relevant variable selection RF methods (Boruta, vita, altmann and r2vim) to identify important variables in high-dimensional data. Then we combined different methods to select genes related to diffuse large B cell lymphoma (DLBCL) classification and constructed the model for diffuse large B cell lymphoma classification diagnosis.  Results  Simulation study showed that vita had higher sensitivity, and biosigner had higher positive predictive value. Empirical study showed that a total of 1 019 genes related to DLBCL classification were obtained by vita method, and 77 genes related to DLBCL classification were obtained by biosigner method. The area under the receiver operating characteristical (ROC) curve (AUC) of the DLBCL typing diagnostic model was 0.910.  Conclusions  Vita and biosigner can be used in the preliminary and final selecting stages of genes related to DLBCL classification. The model we developed can effectively distinguish the different subtypes of DLBCL.
Keywords:Random forest  Variable selection  Diffuse large B-cell lymphoma
点击此处可从《中华疾病控制杂志》浏览原始摘要信息
点击此处可从《中华疾病控制杂志》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号