一种改进的基于欧氏距离的SDRSMOTE算法

计算机工程与科学

一种改进的基于欧氏距离的SDRSMOTE算法

李克文1，林亚林1，杨耀忠2

（1.中国石油大学（华东）计算机与通信工程学院,山东青岛 266580;

2.中国石化胜利油田分公司信息化管理中心，山东东营 257022）

收稿日期:2018-12-21 修回日期:2019-04-23 出版日期:2019-11-25 发布日期:2019-11-25

An improved SDRSMOTE algorithm

based on Euclidean distance

LI Ke-wen1，LIN Ya-lin1，YANG Yao-zhong2

(1.College of Computer & Communication Engineering,China University of Petroleum,Qingdao 266580;

2.Control Center of Informatization Sinopec Shengli Oil Field,Dongying 257022,China)

Received:2018-12-21 Revised:2019-04-23 Online:2019-11-25 Published:2019-11-25

摘要/Abstract

摘要：

SMOTE算法可以扩充少数类样本，提高不平衡数据集中少数类的分类能力，但是它在扩充少数类样本时对于边界样本的选择以及随机数的取值具有盲目性。针对此问题，将传统的SMOTE过采样算法进行改进，改进后的过采样算法定义为SDRSMOTE，该算法综合考虑不平衡数据集中全部样本的分布状况，通过融合支持度sd和影响因素posFac来指导少数类样本的合成。在WEKA平台上分别使用SMOTE、SDRSMOTE算法对所选用的6个不平衡数据集进行过采样数据预处理，然后使用决策树、AdaBoost、Bagging和朴素贝叶斯分类器对预处理后的数据集进行预测，选择F-value、G-mean和AUC作为分类性能的评价指标，实验表明
SDRSMOTE算法预处理的不平衡数据集的分类效果更好，证明了该算法的有效性。

关键词: 不平衡数据集, 分类, 边界样本, 支持度, 影响因素, 欧氏距离, SMOTE

Abstract:

The SMOTE algorithm can extend the minority samples and improve the classification ability of a few classes in the unbalanced data set. However, it blindly chooses boundary samples and the value of random numbers when extending the minority samples. This paper improves the traditional SMOTE oversampling algorithm, called SDRSMOTE. It takes into account all the unbalanced data sets. The distribution of all the samples, through the introduction of support degree sd and the influencing factor posFac to guide the synthesis of the minority samples. On the WEKA platform, the SMOTE and SDRSMOTE algorithms are used to preprocess the selected six unbalanced data sets and use the decision tree, AdaBoost, Bagging and Naive Bayes classifiers to predict the preprocessed datasets. The data set is classified, and F-value, G-mean and AUC are selected as evaluation indexes. The experiment shows that the unbalanced datasets preprocessed by the improved SDRSMOTE algorithm have better classification effect, which proves the effectiveness of the algorithm.

Key words: unbalanced data set, classification, boundary sample, support degree, influencing factor, Euclidean distance, SMOTE

李克文1，林亚林1，杨耀忠2. 一种改进的基于欧氏距离的SDRSMOTE算法[J]. 计算机工程与科学.

LI Ke-wen1，LIN Ya-lin1，YANG Yao-zhong2.

An improved SDRSMOTE algorithm

based on Euclidean distance

[J]. Computer Engineering & Science.

[1]	柴燕涛，董德尊，张鹤颖，朱成阳，廖湘科. 基于SDN架构的高性能网络拥塞避免策略[J]. J4, 20160101, 38(01): 1-10.
[2]	黄智慧, 肖祥立, 张玉书, 薛明富. 基于隐形后门水印的开源数据集版权保护[J]. 计算机工程与科学, 2024, 46(06): 1013-1021.
[3]	肖新正, 黄瑞章, 陈艳平, 秦永彬, 宋玉梅, 周裕林, . Corrective-Net：面向多标签文本分类的标签关联学习模块[J]. 计算机工程与科学, 2024, 46(06): 1092-1100.
[4]	佟缘, 姚念民. 基于对span的预判断和多轮分类的实体关系抽取[J]. 计算机工程与科学, 2024, 46(05): 916-928.
[5]	刘盼, 郭延明, 雷军, 王昊冉, 老松杨, 李国辉. 结合上下文的细粒度实体分类特征表示方法[J]. 计算机工程与科学, 2024, 46(05): 929-936.
[6]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.
[7]	罗月童, 李超, 周波, 张延孔. 面向工业缺陷分类的交互式易混淆缺陷分离方法研究[J]. 计算机工程与科学, 2024, 46(03): 463-470.
[8]	吕伏, 韩晓天, 冯永安, 项梁. 基于自适应纹理特征融合的纹理图像分类方法[J]. 计算机工程与科学, 2024, 46(03): 488-498.
[9]	张远洋, 贡正仙, 孔芳. 增强依存结构表达的零样本跨语言事件论元角色分类[J]. 计算机工程与科学, 2024, 46(03): 508-517.
[10]	庞诺言, 关东海, 袁伟伟. 基于早期时间序列分类的可解释实时机动识别算法[J]. 计算机工程与科学, 2024, 46(02): 353-362.
[11]	马雪, 何星星, 兰咏琪, 李莹芳. 一阶逻辑中基于treelet图神经网络的前提选择[J]. 计算机工程与科学, 2024, 46(02): 374-380.
[12]	焦佳辉, 马思远, 宋玉, 宋伟. 基于卷积注意力机制的双模态音乐流派分类模型MGTN[J]. 计算机工程与科学, 2023, 45(12): 2226-2236.
[13]	杨春霞, 马文文, 徐奔, 韩煜, . 融合标签信息的分层图注意力网络文本分类模型[J]. 计算机工程与科学, 2023, 45(11): 2018-2026.
[14]	张千锟, 韩虎, 郝俊. 基于双注意力融合知识的方面级情感分类[J]. 计算机工程与科学, 2023, 45(10): 1866-1873.
[15]	吕小姣, 张玉梅, 杨红红, 吴晓军, . 基于距离排序的DUPSO-DSVM民歌快速分类算法研究[J]. 计算机工程与科学, 2023, 45(10): 1874-1833.