• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (10): 128-134.

• 论文 • 上一篇    下一篇

一种基于混合重取样策略的非均衡数据集分类算法

谷 琼,袁 磊,宁 彬,吴 钊,华 丽,李文新   

  1. (湖北文理学院数学与计算机科学学院,湖北 襄阳 441053)
  • 收稿日期:2012-04-25 修回日期:2012-07-10 出版日期:2012-10-25 发布日期:2012-10-25
  • 基金资助:

    国家自然科学基金资助项目(61075063,61172084);湖北省自然科学基金资助项目(2010CDB05201)

A Novel Classification Algorithm for ImbalancedDatasets Based on Hybrid Resampling Strategy

GU Qiong,YUAN Lei, NING Bin,WU Zhao, HUA Li,LI Wenxin   

  1. (School of Mathematics and Computer Science,Hubei University of Arts and Science,Xiangyang 441053,China)
  • Received:2012-04-25 Revised:2012-07-10 Online:2012-10-25 Published:2012-10-25

摘要:

非均衡数据是分类中的常见问题,当一类实例远远多于另一类实例,则代表类非均衡,真实世界的分类问题存在很多类别非均衡的情况并得到众多专家学者的重视,非均衡数据的分类问题已成为数据挖掘和模式识别领域中新的研究热点,是对传统分类算法的重大挑战。本文提出了一种新型重取样算法,采用改进的SMOTE算法对少数类数据进行过取样,产生新的少数类样本,使类之间数据量基本均衡,然后再根据SMO算法的特点,提出使用聚类的数据欠取样方法,删除冗余或噪音数据。通过对数据集的过取样和清理之后,一些有用的样本被保留下来,减少了数据集规模,增强支持向量机训练执行的效率。实验结果表明,该方法在保持整体分类性能的情况下可以有效地提高少数类的分类精度。

关键词: 分类, 非均衡数据集, 预处理, 混合重取样, SMOTE, 聚类

Abstract:

Imbalanced data is a common problem in classification,this issue occurs when the number of examples of one class is much smaller than the ones of the other classes.Its presence in many realworld applications has attracted a growth of attention from researchers.Classifier learning with datasets that suffer from imbalanced class distributions is a challenging problem in data mining and pattern recognition community.In this paper, we present a novel preprocessing approach that combines unsupervised clustering and supervised learning to handle imbalanced data set and apply this learning approach for training SMO. This proposed algorithm lessen the imbalance ration through the construction of new samples using the improved synthetic minority oversampling technique and then clustering for both classes to delete redundant or noisy samples. Thus, the useful samples are remained,improving the computational efficiency.Experimental results show that the proposed approach can effectively improve the classification accuracy of the minority classes,while maintaining the overall classification performance.

Key words: classification;imbalanced dataset;preprocessing;hybrid resampling;SMOTE;clustering