• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (05): 788-799.

• 高性能计算 • 上一篇    下一篇

基于Hellinger距离的不平衡漂移数据流Boosting分类算法

张喜龙,韩萌,陈志强,武红鑫,李慕航   

  1. (北方民族大学计算机科学与工程学院,宁夏 银川 750021)
  • 收稿日期:2021-11-09 修回日期:2022-01-15 接受日期:2022-05-25 出版日期:2022-05-25 发布日期:2022-05-24
  • 基金资助:
    国家自然科学基金(62062004);宁夏自然科学基金(2020AAC03216,2022AAC03279);北方民族大学研究生创新项目(YCX21085)

A Boosting classification algorithm for imbalanced drift data stream based on Hellinger distance

ZHANG Xi-long,HAN Meng,CHEN Zhi-qiang,WU Hong-xin,LI Mu-hang   

  1. (School of Computer Science and Engineering,North Minzu University,Yinchuan 750021,China)
  • Received:2021-11-09 Revised:2022-01-15 Accepted:2022-05-25 Online:2022-05-25 Published:2022-05-24

摘要: 数据流中的不平衡问题会严重影响算法的分类性能,其中概念漂移更是流数据挖掘研究领域的一个难点问题。为了提高此类问题下的分类性能,提出了一种新的基于Hellinger距离的不平衡漂移数据流Boosting分类BCA-HD算法。该算法创新性地采用实例级和分类器级的权重组合方式来动态更新分类器,以适应概念漂移的发生,在底层采用集成算法SMOTEBoost作为基分类器,该分类器内部使用重采样技术处理数据的不平衡。在16个突变型和渐变型的数据集上将所提算法与9种不同算法进行比较,实验结果表明,所提算法的G-mean和AUC的平均值和平均排名均为第1名。因此,该算法能更好地适应概念漂移和不平衡现象的同时发生,有助于提高分类性能。

关键词: 数据流, 不平衡数据, 概念漂移, Boosting, Hellinger距离

Abstract: Imbalanced data stream will seriously affect the classification performance of the algorithm and the emer-gence of concept drift is a difficult problem in the field of stream data mining. In order to improve the classification performance of such problem, a new Boosting Classification Algorithm for imbalanced drifted data stream based on Hellinger Distance (BCA-HD) is proposed. The algorithm innovatively uses the weighted combination of instance level and classifier level to dynamically update the classifier to adapt to the occurrence of concept drift. The integrated algorithm SMOTEBoost is used as the base classifier at the bottom layer, and the classifier uses resampling technology to deal with the imbalanced data. Finally, the proposed algorithm is compared with 9 different algorithms on 16 abrupt and gradual datasets. The results show that average value and average rankings of G-mean and AUC are both ranked first. Experiments show that the algorithm can better adapt to the simultaneous occurrence of concept drift and imbalance, which helps to improve the classification performance.

Key words: data stream, imbalanced data, concept drift, Boosting, Hellinger distance