首页 | 本学科首页   官方微博 | 高级检索  
     

基于支持向量机与XGboost的成年人群肿瘤患病风险预测研究
引用本文:马倩倩,孙东旭,石金铭,何贤英,翟运开. 基于支持向量机与XGboost的成年人群肿瘤患病风险预测研究[J]. 中国全科医学, 2020, 23(12): 1486-1491. DOI: 10.12114/j.issn.1007-9572.2020.00.066
作者姓名:马倩倩  孙东旭  石金铭  何贤英  翟运开
作者单位:1.450052河南省郑州市,郑州大学第一附属医院 2.450052河南省郑州市互联网医疗系统与应用国家工程实验室 3.450001河南省郑州市,郑州大学管理工程学院
*通信作者:翟运开,教授,博士生导师;E-mail:zhaiyunkai@zzu.edu.cn
基金项目:河南省高校科技创新团队支持计划项目(20IRTSTHN028);国家重点研发计划项目(2017YFC0909900);河南省重大科技专项(151100310800)。
摘    要:背景 肿瘤风险预测对于提高人群健康水平、降低患者经济负担意义重大。但随着医疗大数据的产生,传统的统计预测方法逐渐无法满足需求,有必要尝试开展机器学习等新方法在肿瘤预测领域的应用。目的 探讨支持向量机与XGboost和逐步Logistic回归分析在成年人群肿瘤患病风险中的预测价值。方法 本研究时间为2011-2015年,数据来源于中国健康与营养调查(CHNS),以我国12个地区(黑龙江、辽宁、湖南、山东、贵州、江苏、广西、湖北、河南、北京、上海和重庆)城乡成年(≥18岁)常住居民为对象,经过数据清理,最终纳入19 410人为本研究对象。将研究对象按2∶1分为训练集和测试集,基于逐步Logistic回归分析的变量筛选策略,在训练集上分别建立逐步Logistic回归分析、支持向量机、XGboost肿瘤患病风险预测模型,并在测试集上进行验证。通过比较各模型受试者工作特征曲线(ROC曲线)下面积(AUC),分析各模型预测肿瘤患病风险的性能。结果 19 410例研究对象中,被诊断为肿瘤患者262例(1.35%)。训练集(n=12 919)中含有174例肿瘤患者,测试集(n=6 491)含有88例肿瘤患者。逐步Logistic回归分析、支持向量机、XGboost在测试集中预测成年人群患肿瘤的正确率分别为72.96%〔95%CI(71.86%,74.04%)〕、99.54%〔95%CI(99.34%,99.69%)〕、70.05%〔95%CI(68.92%,71.16%)〕,AUC分别为76.75%〔95%CI(72.35%,81.14%)〕、86.32%〔95%CI(81.64%,91.00%)〕、79.03%〔95%CI(74.96%,83.10%)〕。支持向量机、XGboost预测成年人群患肿瘤的AUC与Logistic回归模型比较,差异有统计学意义(Z值分别为-2.519、-2.138,P值分别为0.012、0.032);XGboost预测成年人群患肿瘤的AUC低于支持向量机,差异有统计学意义(Z=2.081,P=0.037)。结论 支持向量机相较于逐步Logistic回归分析预测成年人群肿瘤患病风险的正确率、灵敏度、特异度、AUC等指标较好,而XGboost未见明显优势,但考虑到逐步Logistic回归分析操作的便捷性和可解释性优势,建议在肿瘤风险预测方面,采用支持向量机与逐步Logistic回归分析相结合的模式。

关 键 词:肿瘤  健康状况  逐步Logistic回归分析  支持向量机  XGboost  预测  

Risk Prediction of Cancer in Adult Population Based on Support Vector Machine versus XGboost
MA Qianqian,SUN Dongxun,SHI Jinming,HE Xianying,ZHAI Yunkai. Risk Prediction of Cancer in Adult Population Based on Support Vector Machine versus XGboost[J]. Chinese General Practice, 2020, 23(12): 1486-1491. DOI: 10.12114/j.issn.1007-9572.2020.00.066
Authors:MA Qianqian  SUN Dongxun  SHI Jinming  HE Xianying  ZHAI Yunkai
Affiliation:1.The First Affiliated Hospital of Zhengzhou University,Zhengzhou 450052,China
2.National Engineering Laboratory for Internet Medical Systems and Applications,Zhengzhou 450052,China
3.School of Management Engineering,Zhengzhou University,Zhengzhou 450001,China
*Corresponding author:ZHAI Yunkai,Professor,Doctoral supervisor;E-mail:zhaiyunkai@zzu.edu.cn
Abstract:Background The risk prediction of cancer is of great importance for improving the health of the population and reducing the economic burden of patients.However,with the development of medical big data,traditional statistical forecasting methods are gradually unable to meet the demand,and the application of new methods such as machine learning in the field of cancer prediction is necessary.Objective To explore the application of XGboost,Support Vector Machine(SVM)and stepwise logistic regression(SLR)model in cancer risk prediction.Methods The data were collected from the China Health and Nutrition Survey(CHNS)in 2011 and 2015,targeting at urban and rural adults(≥18 years old)in 12 regions of China(including Heilongjiang,Liaoning,Hunan,Shandong,Guizhou,Jiangsu,Guangxi,Hubei,Henan,Beijing,Shanghai,and Chongqing).After data cleansing,19410 subjects were eventually included in the study,which were further divided into the training set and test set by a ratio of 2:1.Based on the variables screened by stepwise regression,the SLR,SVM,and XGboost prediction models were established using the training set and were verified using the test set,respectively.The area under the receiver operating characteristic(ROC)curve(AUC)was applied to evaluate the prediction performance of each model.Results Of the 19410 subjects,262 were diagnosed with cancer(1.35%),174 of whom were included in the training set(n=12919),and other 88 were included in the test set(n=6491).The accuracies of SLR,SVM and XGboost models in the test set were 72.96%[95%CI(71.86%,74.04%)],99.54%[95%CI(99.34%,99.69%)],70.05%[95%CI(68.92%,71.16%)],respectively.The AUCs of the three models were 76.75%[95%CI(72.35%,81.14%)],86.32%[95%CI(81.64%,91.00%)],79.03%[95%CI(74.96%,83.10%)],respectively.The AUC of SLR was significantly different from that of SVM model(Z=-2.519,P=0.012)and XGboost model(Z=-2.138,P=0.032).The AUC of XGboost was much smaller than that of SVM(Z=2.081,P=0.037).Conclusion In predicting the risk of cancer in adults,compared with the SLR model,SVM model shows better accuracy,sensitivity,specificity,and AUC,while the XGboost model has no significant improvement in predictive performance.Considering the advantages of the SLR in terms of easy operation and interpretability,it is recommended to use both SVM and SLR in the prediction of cancer risk.
Keywords:Neoplasms  Health status  Stepwise Logistic regression  Support vector machine  XGboost  Forecasting
本文献已被 维普 等数据库收录!
点击此处可从《中国全科医学》浏览原始摘要信息
点击此处可从《中国全科医学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号