• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

一种改进的基于欧氏距离的SDRSMOTE算法

李克文1,林亚林1,杨耀忠2   

  1. (1.中国石油大学(华东)计算机与通信工程学院,山东 青岛 266580;
    2.中国石化胜利油田分公司信息化管理中心,山东 东营 257022)
  • 收稿日期:2018-12-21 修回日期:2019-04-23 出版日期:2019-11-25 发布日期:2019-11-25

An improved SDRSMOTE algorithm
based on Euclidean distance

LI Ke-wen1,LIN Ya-lin1,YANG Yao-zhong2   

  1. (1.College of Computer & Communication Engineering,China University of Petroleum,Qingdao 266580;
    2.Control Center of Informatization Sinopec Shengli Oil Field,Dongying 257022,China)

     
  • Received:2018-12-21 Revised:2019-04-23 Online:2019-11-25 Published:2019-11-25

摘要:

SMOTE算法可以扩充少数类样本,提高不平衡数据集中少数类的分类能力,但是它在扩充少数类样本时对于边界样本的选择以及随机数的取值具有盲目性。针对此问题,将传统的SMOTE过采样算法进行改进,改进后的过采样算法定义为SDRSMOTE,该算法综合考虑不平衡数据集中全部样本的分布状况,通过融合支持度sd和影响因素posFac来指导少数类样本的合成。在WEKA平台上分别使用SMOTE、SDRSMOTE算法对所选用的6个不平衡数据集进行过采样数据预处理,然后使用决策树、AdaBoost、Bagging和朴素贝叶斯分类器对预处理后的数据集进行预测,选择F-value、G-mean和AUC作为分类性能的评价指标,实验表明
SDRSMOTE算法预处理的不平衡数据集的分类效果更好,证明了该算法的有效性。

关键词: 不平衡数据集, 分类, 边界样本, 支持度, 影响因素, 欧氏距离, SMOTE

Abstract:

The SMOTE algorithm can extend the minority samples and improve the classification ability of a few classes in the unbalanced data set. However, it blindly chooses boundary samples and the value of random numbers when extending the minority samples. This paper improves the traditional SMOTE oversampling algorithm, called SDRSMOTE. It takes into account all the unbalanced data sets. The distribution of all the samples, through the introduction of support degree sd and the influencing factor posFac to guide the synthesis of the minority samples. On the WEKA platform, the SMOTE and SDRSMOTE algorithms are used to preprocess the selected six unbalanced data sets and use the decision tree, AdaBoost, Bagging and Naive Bayes classifiers to predict the preprocessed datasets. The data set is classified, and F-value, G-mean and AUC are selected as evaluation indexes. The experiment shows that the unbalanced datasets preprocessed by the improved SDRSMOTE algorithm have better classification effect, which proves the effectiveness of the algorithm.

Key words: unbalanced data set, classification, boundary sample, support degree, influencing factor, Euclidean distance, SMOTE