首页 | 本学科首页   官方微博 | 高级检索  
检索        

重采样技术在中老年居民糖尿病不平衡数据分类中的应用
引用本文:张乐,王如意,杨慧,朱素玲.重采样技术在中老年居民糖尿病不平衡数据分类中的应用[J].现代预防医学,2023,0(7):1339-1344.
作者姓名:张乐  王如意  杨慧  朱素玲
作者单位:兰州大学公共卫生学院,甘肃 兰州730000
基金项目:中央高校基本科研业务费专项资金(lzujbky-2022-16);
摘    要:目的 利用重采样技术提高我国中老年居民糖尿病不平衡数据的分类预测效果。方法 采用随机欠采样、随机过采样、合成少数类过采样(synthetic minority oversampling technique, SMOTE)以及自适应合成抽样(adaptive synthetic sampling, ADASYN)四种重采样技术处理CHARLS数据库中糖尿病不平衡数据,比较重采样前后logistic回归、支持向量机、随机森林的分类性能,采用G-means和AUC评价模型的预测效果。结果 对CHARLS糖尿病不平衡数据集,logistic回归、支持向量机、随机森林模型的G-means分别为0.222 7、0、0,AUC分别为0.761 2、0.736 3、0.742 9,logistic回归模型显著优于支持向量机,模型准确率(χ2=1 231.501,P<0.001)及AUC值(Z=2.634, P=0.028)的差异均具有统计学意义。四种重采样技术处理后模型的G-means均提高,特别是SMOTE和ADASYN技术;此外,随机欠采样不能显著提高logistic回...

关 键 词:不平衡分类  重采样  糖尿病  中老年居民

Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents
ZHANG Le,WANG Ru-yi,YANG Hui,ZHU Su-ling.Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents[J].Modern Preventive Medicine,2023,0(7):1339-1344.
Authors:ZHANG Le  WANG Ru-yi  YANG Hui  ZHU Su-ling
Institution:School of Public Health, Lanzhou University, Lanzhou, Gansu 730000, China
Abstract:Objective To improve the effect of classification prediction of imbalanced diabetes data in Chinese middle-aged and elderly residents using resampling techniqsues. Methods Random undersampling, random oversampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN) were used to process the imbalanced diabetes data in the China Health and Retirement Longitudinal Study (CHARLS) database. The classification performance of logistic regression, support vector machine, and random forest before and after resampling were compared. The prediction effects of the models were evaluated by G-means and Area Under Curve (AUC). Results Based on the imbalanced diabetes dataset in CHARLS, the G-means of logistic regression, support vector machine, and random forest respective were 0.2227, 0, and 0, and the AUC values were 0.7612, 0.7363, and 0.7429, respectively. Logistic regression model was significantly better than support vector machine in terms of accuracy (χ2=1 231.501, P<0.001) and AUC (Z=2.634, P=0.028). After resampling, the G-means of the models were improved after applying four resampling techniques, especially SMOTE and ADASYN techniques. In addition, the random undersampling could not significantly improve the AUC of logistic regression (Z=3.027, P=0.003), support vector machine (Z=0.301, P=0.764), and random forest classification (Z=0.446, P=0.656). The random oversampling, SMOTE, and ADASYN techniques improved the AUC of the classification models to varying degrees. Conclusion SMOTE and ADASYN can better handle the problem of imbalanced diabetes data and improve the predictive performance of diabetes classifiers.
Keywords:Imbalanced classification  Resampling  Diabetes  Middle-aged and elderly residents
点击此处可从《现代预防医学》浏览原始摘要信息
点击此处可从《现代预防医学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号