• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (05): 917-925.

• 人工智能与数据挖掘 • 上一篇    下一篇

不平衡数据多粒度集成分类算法研究

陈丽芳,代琪,赵佳亮   

  1. (华北理工大学理学院,河北 唐山 063210)

  • 收稿日期:2020-03-07 修回日期:2020-05-13 接受日期:2021-05-25 出版日期:2021-05-25 发布日期:2021-05-19
  • 基金资助:
    河北省自然科学基金(F2014209086)

A multi-granularity ensemble classification algorithm for imbalanced data

CHEN Li-fang,DAI Qi,ZHAO Jia-liang#br#

#br#
  

  1. (College of Science,North China University of Science and Technology,Tangshan 063210,China)


  • Received:2020-03-07 Revised:2020-05-13 Accepted:2021-05-25 Online:2021-05-25 Published:2021-05-19

摘要: 针对传统模型在解决不平衡数据分类问题时存在精度低、稳定性差、泛化能力弱等问题,提出基于序贯三支决策多粒度集成分类算法MGE-S3WD。采用二元关系实现粒层动态划分;根据代价矩阵计算阈值并构建多层次粒结构,将各粒层数据划分为正域、边界域和负域;将各粒层上的划分,按照正域与负域、正域与边界域、负域与边界域重新组合形成新的数据子集,并在各数据子集上构建基分类器,实现不平衡数据的集成分类。仿真结果表明,该算法能够有效降低数据子集的不平衡比,提升集成学习中基分类器的差异性,在G-mean和F-measure1 2个评价指标下,分类性能优于或部分优于其他集成分类算法,有效提高了分类模型的分类精度和稳定性,为不平衡数据集的集成学习提供了新的研究思路。

关键词: 序贯三支决策, 多粒度, 代价敏感, 不平衡数据, 集成学习

Abstract: To address the problems of low accuracy, poor stability and weak generalization ability used in the traditional model when solving the problem of imbalanced data classification, a sequential three-way decision multi-granulation ensemble classification algorithm is proposed. A binary relationship is adopted to realize the dynamic division of the granular layer. The threshold value is calculated according to the cost matrix and a multi-layer granular structure is constructed. The data of each granular layer is divided into a positive domain, a boundary domain, and a negative domain, and the division on each granular layer is recombined according to positive and negative domains, positive and boundary domains, and negative and boundary domains to form a new data subset. A base classifier is built on each data subset to achieve the ensemble classification of imbalanced data. Simulation results show that the algorithm can effectively reduce the imbalance ratio of data subsets and improve the difference of the base classifier in ensemble learning. Under the two evaluation indexes of G-mean and F-measure1, the classification performance is better or partially better than other ensemble classification algorithms. The new algorithm effectively improves the classification accuracy and stability of the classification model, and provides new research thoughts for ensemble learning of imbalanced data sets.




Key words: sequential three-way decision;multi-granularity;cost sensitive;imbalanced data, ensemble learning