重采样技术在中老年居民糖尿病不平衡数据分类中的应用 Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

重采样技术在中老年居民糖尿病不平衡数据分类中的应用

引用本文：	张乐,王如意,杨慧,朱素玲.重采样技术在中老年居民糖尿病不平衡数据分类中的应用[J].现代预防医学,2023,0(7):1339-1344.

作者姓名：	张乐王如意杨慧朱素玲

作者单位：	兰州大学公共卫生学院，甘肃兰州730000

基金项目：	中央高校基本科研业务费专项资金（lzujbky-2022-16）；

摘要：	目的利用重采样技术提高我国中老年居民糖尿病不平衡数据的分类预测效果。方法采用随机欠采样、随机过采样、合成少数类过采样(synthetic minority oversampling technique, SMOTE)以及自适应合成抽样(adaptive synthetic sampling, ADASYN)四种重采样技术处理CHARLS数据库中糖尿病不平衡数据，比较重采样前后logistic回归、支持向量机、随机森林的分类性能，采用G-means和AUC评价模型的预测效果。结果对CHARLS糖尿病不平衡数据集，logistic回归、支持向量机、随机森林模型的G-means分别为0.222 7、0、0,AUC分别为0.761 2、0.736 3、0.742 9,logistic回归模型显著优于支持向量机，模型准确率(χ²=1 231.501,P<0.001)及AUC值(Z=2.634, P=0.028)的差异均具有统计学意义。四种重采样技术处理后模型的G-means均提高，特别是SMOTE和ADASYN技术；此外，随机欠采样不能显著提高logistic回...
关键词：	不平衡分类重采样糖尿病中老年居民
Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents

ZHANG Le,WANG Ru-yi,YANG Hui,ZHU Su-ling.Application of resampling technique in the classification of imbalanced diabetes data in middle-aged and elderly residents[J].Modern Preventive Medicine,2023,0(7):1339-1344.

Authors:	ZHANG Le WANG Ru-yi YANG Hui ZHU Su-ling

Institution:	School of Public Health, Lanzhou University, Lanzhou, Gansu 730000, China

Abstract:	Objective To improve the effect of classification prediction of imbalanced diabetes data in Chinese middle-aged and elderly residents using resampling techniqsues. Methods Random undersampling, random oversampling, synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN) were used to process the imbalanced diabetes data in the China Health and Retirement Longitudinal Study (CHARLS) database. The classification performance of logistic regression, support vector machine, and random forest before and after resampling were compared. The prediction effects of the models were evaluated by G-means and Area Under Curve (AUC). Results Based on the imbalanced diabetes dataset in CHARLS, the G-means of logistic regression, support vector machine, and random forest respective were 0.2227, 0, and 0, and the AUC values were 0.7612, 0.7363, and 0.7429, respectively. Logistic regression model was significantly better than support vector machine in terms of accuracy (χ2=1 231.501, P<0.001) and AUC (Z=2.634, P=0.028). After resampling, the G-means of the models were improved after applying four resampling techniques, especially SMOTE and ADASYN techniques. In addition, the random undersampling could not significantly improve the AUC of logistic regression (Z=3.027, P=0.003), support vector machine (Z=0.301, P=0.764), and random forest classification (Z=0.446, P=0.656). The random oversampling, SMOTE, and ADASYN techniques improved the AUC of the classification models to varying degrees. Conclusion SMOTE and ADASYN can better handle the problem of imbalanced diabetes data and improve the predictive performance of diabetes classifiers.

Keywords:	Imbalanced classification Resampling Diabetes Middle-aged and elderly residents

	点击此处可从《现代预防医学》浏览原始摘要信息
	点击此处可从《现代预防医学》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏