一种基于重取样的代价敏感学习算法

J4 ›› 2011, Vol. 33 ›› Issue (9): 130-135.

一种基于重取样的代价敏感学习算法

谷〓琼,袁〓磊,宁〓彬,熊启军,华〓丽,李文新

（襄樊学院数学与计算机科学学院,湖北襄阳441053）

收稿日期:2011-05-20 修回日期:2011-07-26 出版日期:2011-09-25 发布日期:2011-09-25
作者简介:谷琼(1973),女,湖北荆门人，博士，讲师，研究方向为数据挖掘、演化计算和机器学习。

A Novel Cost Sensitive Learning Algorithm Based on Resampling

GU Qiong，YUAN Lei，NING Bin，XIONG Qijun，HUA Li，LI Wenxin

(School of Mathematics and Computer Science,Xiangfan University,Xiangyang 441053,China)

Received:2011-05-20 Revised:2011-07-26 Online:2011-09-25 Published:2011-09-25

摘要/Abstract

摘要：

大多数非均衡数据集的研究集中于纯重构数据集或者纯代价敏感学习，本文针对数据集类分布非均衡和不相等误分类代价往往同时发生这一事实，提出了一种以最小误分类代价为目标的基于混合重取样的代价敏感学习算法。该算法将两种不同类型解决方案有机地融合在一起，先用样本类空间重构的方法使原始数据集的两类数据达到基本均衡，然后再引入代价敏感学习算法进行分类，能提高少数类分类精度，同时有效降低总的误分类代价。实验结果验证了该算法在处理非均衡类问题时比传统算法要优越。

关键词: 分类, 非均衡数据集, 混合重取样, 代价敏感学习

Abstract:

Most studies on the imbalanced data set classification focus on the discussion of resampling or costsensitive learning systems themselves; however, the fact that the costs of imbalanced class distribution and unequal misclassification errors always occur simultaneously is neglected. We propose a novel cost sensitive learning (CSL) algorithm which combines the methods of resampling and the CSL techniques together in order to solve the misclassification problem of imbalanced data set. On one hand, the resampling technique allows the balanced data sets by reconstructing both the majority and the minority class. On the other hand, the classification is performed based on the minimal misclassification cost but not the maximal accuracy. Here the misclassification cost for the minority class is much higher than the misclassification cost for the majority class. A costsensitive learning procedure is then conducted for classification. The experimental results show that the proposed method can improve the classification accuracy and decrease the misclassification cost effectively, and the algorithm is superior to the traditional algorithms as for dealing with the imbalanced problem.

Key words: classification;imbalanced dataset;hybrid resampling;cost sensitive learning

谷〓琼,袁〓磊,宁〓彬,熊启军,华〓丽,李文新. 一种基于重取样的代价敏感学习算法[J]. J4, 2011, 33(9): 130-135.

GU Qiong，YUAN Lei，NING Bin，XIONG Qijun，HUA Li，LI Wenxin. A Novel Cost Sensitive Learning Algorithm Based on Resampling[J]. J4, 2011, 33(9): 130-135.

[1]	柴燕涛，董德尊，张鹤颖，朱成阳，廖湘科. 基于SDN架构的高性能网络拥塞避免策略[J]. J4, 20160101, 38(01): 1-10.
[2]	沈凡凡, 汤星译, 张军, 徐超, 陈勇, 何炎祥. 基于改进萤火虫算法和长短期记忆网络的恶意行为检测方法[J]. 计算机工程与科学, 2024, 46(12): 2158-2170.
[3]	冯兴杰, 曹若轩. 融合特征投影和负监督的文本分类[J]. 计算机工程与科学, 2024, 46(10): 1864-1874.
[4]	刘强, 李沐春, 伍晓洁, 王煜恒. S-JSMA：一种低扰动冗余的快速JSMA对抗样本生成方法[J]. 计算机工程与科学, 2024, 46(08): 1395-1402.
[5]	黄智慧, 肖祥立, 张玉书, 薛明富. 基于隐形后门水印的开源数据集版权保护[J]. 计算机工程与科学, 2024, 46(06): 1013-1021.
[6]	肖新正, 黄瑞章, 陈艳平, 秦永彬, 宋玉梅, 周裕林, . Corrective-Net：面向多标签文本分类的标签关联学习模块[J]. 计算机工程与科学, 2024, 46(06): 1092-1100.
[7]	佟缘, 姚念民. 基于对span的预判断和多轮分类的实体关系抽取[J]. 计算机工程与科学, 2024, 46(05): 916-928.
[8]	刘盼, 郭延明, 雷军, 王昊冉, 老松杨, 李国辉. 结合上下文的细粒度实体分类特征表示方法[J]. 计算机工程与科学, 2024, 46(05): 929-936.
[9]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.
[10]	罗月童, 李超, 周波, 张延孔. 面向工业缺陷分类的交互式易混淆缺陷分离方法研究[J]. 计算机工程与科学, 2024, 46(03): 463-470.
[11]	吕伏, 韩晓天, 冯永安, 项梁. 基于自适应纹理特征融合的纹理图像分类方法[J]. 计算机工程与科学, 2024, 46(03): 488-498.
[12]	张远洋, 贡正仙, 孔芳. 增强依存结构表达的零样本跨语言事件论元角色分类[J]. 计算机工程与科学, 2024, 46(03): 508-517.
[13]	庞诺言, 关东海, 袁伟伟. 基于早期时间序列分类的可解释实时机动识别算法[J]. 计算机工程与科学, 2024, 46(02): 353-362.
[14]	马雪, 何星星, 兰咏琪, 李莹芳. 一阶逻辑中基于treelet图神经网络的前提选择[J]. 计算机工程与科学, 2024, 46(02): 374-380.
[15]	焦佳辉, 马思远, 宋玉, 宋伟. 基于卷积注意力机制的双模态音乐流派分类模型MGTN[J]. 计算机工程与科学, 2023, 45(12): 2226-2236.