• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (9): 130-135.

• 论文 • 上一篇    下一篇

一种基于重取样的代价敏感学习算法

谷〓琼,袁〓磊,宁〓彬,熊启军,华〓丽,李文新   

  1. (襄樊学院数学与计算机科学学院,湖北 襄阳441053)
  • 收稿日期:2011-05-20 修回日期:2011-07-26 出版日期:2011-09-25 发布日期:2011-09-25
  • 作者简介:谷琼(1973),女,湖北荆门人,博士,讲师,研究方向为数据挖掘、演化计算和机器学习。

A Novel Cost Sensitive Learning Algorithm Based on Resampling

GU Qiong,YUAN Lei,NING Bin,XIONG Qijun,HUA Li,LI Wenxin   

  1. (School of Mathematics and  Computer Science,Xiangfan University,Xiangyang 441053,China)
  • Received:2011-05-20 Revised:2011-07-26 Online:2011-09-25 Published:2011-09-25

摘要:

大多数非均衡数据集的研究集中于纯重构数据集或者纯代价敏感学习,本文针对数据集类分布非均衡和不相等误分类代价往往同时发生这一事实,提出了一种以最小误分类代价为目标的基于混合重取样的代价敏感学习算法。该算法将两种不同类型解决方案有机地融合在一起,先用样本类空间重构的方法使原始数据集的两类数据达到基本均衡,然后再引入代价敏感学习算法进行分类,能提高少数类分类精度,同时有效降低总的误分类代价。实验结果验证了该算法在处理非均衡类问题时比传统算法要优越。

关键词: 分类, 非均衡数据集, 混合重取样, 代价敏感学习

Abstract:

Most studies on the imbalanced data set classification focus on the discussion of resampling or costsensitive learning systems themselves; however, the fact that the costs of imbalanced class distribution and unequal misclassification errors always occur simultaneously is neglected. We propose a novel cost sensitive learning (CSL) algorithm which combines the  methods of resampling and the CSL techniques together in order to solve the misclassification problem of imbalanced data set. On one hand, the resampling technique allows the balanced data sets by reconstructing both the majority and the minority class. On the other hand, the classification is performed based on the minimal misclassification cost but not the maximal accuracy. Here the misclassification cost for the minority class is much higher than the misclassification cost for the majority class. A costsensitive learning procedure is then conducted for classification. The experimental results show that the proposed method can improve the classification accuracy and decrease the misclassification cost effectively, and the algorithm is superior to the traditional algorithms as for dealing with the imbalanced problem.

Key words: classification;imbalanced dataset;hybrid resampling;cost sensitive learning