基于Hellinger距离的不平衡漂移数据流Boosting分类算法

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (05): 788-799.

基于Hellinger距离的不平衡漂移数据流Boosting分类算法

张喜龙，韩萌，陈志强，武红鑫，李慕航

（北方民族大学计算机科学与工程学院，宁夏银川 750021）

收稿日期:2021-11-09 修回日期:2022-01-15 接受日期:2022-05-25 出版日期:2022-05-25 发布日期:2022-05-24
基金资助:
国家自然科学基金(62062004)；宁夏自然科学基金(2020AAC03216,2022AAC03279)；北方民族大学研究生创新项目(YCX21085)

A Boosting classification algorithm for imbalanced drift data stream based on Hellinger distance

ZHANG Xi-long,HAN Meng,CHEN Zhi-qiang,WU Hong-xin,LI Mu-hang

（School of Computer Science and Engineering,North Minzu University,Yinchuan 750021，China）

Received:2021-11-09 Revised:2022-01-15 Accepted:2022-05-25 Online:2022-05-25 Published:2022-05-24

摘要/Abstract

摘要： 数据流中的不平衡问题会严重影响算法的分类性能，其中概念漂移更是流数据挖掘研究领域的一个难点问题。为了提高此类问题下的分类性能，提出了一种新的基于Hellinger距离的不平衡漂移数据流Boosting分类BCA-HD算法。该算法创新性地采用实例级和分类器级的权重组合方式来动态更新分类器，以适应概念漂移的发生，在底层采用集成算法SMOTEBoost作为基分类器，该分类器内部使用重采样技术处理数据的不平衡。在16个突变型和渐变型的数据集上将所提算法与9种不同算法进行比较，实验结果表明，所提算法的G-mean和AUC的平均值和平均排名均为第1名。因此，该算法能更好地适应概念漂移和不平衡现象的同时发生，有助于提高分类性能。

关键词: 数据流, 不平衡数据, 概念漂移, Boosting, Hellinger距离

Abstract: Imbalanced data stream will seriously affect the classification performance of the algorithm and the emer-gence of concept drift is a difficult problem in the field of stream data mining. In order to improve the classification performance of such problem, a new Boosting Classification Algorithm for imbalanced drifted data stream based on Hellinger Distance (BCA-HD) is proposed. The algorithm innovatively uses the weighted combination of instance level and classifier level to dynamically update the classifier to adapt to the occurrence of concept drift. The integrated algorithm SMOTEBoost is used as the base classifier at the bottom layer, and the classifier uses resampling technology to deal with the imbalanced data. Finally, the proposed algorithm is compared with 9 different algorithms on 16 abrupt and gradual datasets. The results show that average value and average rankings of G-mean and AUC are both ranked first. Experiments show that the algorithm can better adapt to the simultaneous occurrence of concept drift and imbalance, which helps to improve the classification performance.

Key words: data stream, imbalanced data, concept drift, Boosting, Hellinger distance

张喜龙, 韩萌, 陈志强, 武红鑫, 李慕航. 基于Hellinger距离的不平衡漂移数据流Boosting分类算法[J]. 计算机工程与科学, 2022, 44(05): 788-799.

ZHANG Xi-long, HAN Meng, CHEN Zhi-qiang, WU Hong-xin, LI Mu-hang. A Boosting classification algorithm for imbalanced drift data stream based on Hellinger distance[J]. Computer Engineering & Science, 2022, 44(05): 788-799.

[1]	李金熹, 尹首一, 魏少军, 胡杨. 基于MLIR的数据流模型[J]. 计算机工程与科学, 2024, 46(07): 1151-1157.
[2]	马汉达, 朱敏. 改进SVM不平衡数据分类的IGWOSMOTE方法[J]. 计算机工程与科学, 2022, 44(06): 1133-1140.
[3]	董宏成, 文志云, 万玉辉, 晏飞扬, . 基于DPC聚类重采样结合ELM的不平衡数据分类算法[J]. 计算机工程与科学, 2021, 43(10): 1856-1863.
[4]	陈丽芳, 代琪, 赵佳亮. 不平衡数据多粒度集成分类算法研究[J]. 计算机工程与科学, 2021, 43(05): 917-925.
[5]	张馨予, 安建成, 曹锐. 基于自适应随机森林的数据流分类算法[J]. 计算机工程与科学, 2020, 42(03): 543-549.
[6]	李克文1，林亚林1，杨耀忠2. 一种改进的基于欧氏距离的SDRSMOTE算法[J]. 计算机工程与科学, 2019, 41(11): 2063-.
[7]	袁泉1,2,郭江帆1,赵学华1. 一种基于集成的不均衡数据流分类算法[J]. 计算机工程与科学, 2019, 41(08): 1519-1524.
[8]	张忠林，吴挡平. 基于概率阈值Bagging算法的不平衡数据分类方法[J]. 计算机工程与科学, 2019, 41(06): 1086-1094.
[9]	张盼盼，尹绍宏. 隐含概念漂移的不确定数据流集成分类算法[J]. 计算机工程与科学, 2016, 38(07): 1510-1516.
[10]	李慧，李正，佘堃. 一种基于综合不放回抽样的随机森林算法改进[J]. J4, 2015, 37(07): 1233-1238.
[11]	戴翔1，毛宇光1,2. 基于集成混合采样的软件缺陷预测研究[J]. J4, 2015, 37(05): 930-936.
[12]	张育培，刘树慧. 基于特征漂移的数据流集成分类方法[J]. J4, 2014, 36(05): 977-985.
[13]	曹波伟，薛青，汤再江. 面向装备作战仿真数据流的改进型贝叶斯分类方法研究[J]. J4, 2013, 35(12): 167-172.
[14]	欧阳震诤1,陶孜谨1,蔡建宇2,吴泉源1. 一种不平衡噪声数据流集成分类模型[J]. J4, 2011, 33(12): 99-105.
[15]	王涛[1] 李舟军[2] 颜跃进[1]. 一种基于哈希链表的高效概念漂移连续属性处理算法[J]. J4, 2008, 30(8): 65-68.