基于自适应随机森林的数据流分类算法

计算机工程与科学

基于自适应随机森林的数据流分类算法

张馨予,安建成,曹锐

(太原理工大学软件学院,山西晋中 030600)

收稿日期:2019-08-04 修回日期:2019-11-01 出版日期:2020-03-25 发布日期:2020-03-25
基金资助:
国家自然科学基金（61741212）

A data stream classification algorithm based

on adaptive random forest ensemble model

ZHANG Xin-yu,AN Jian-cheng,CAO Rui

(School of Software,Taiyuan University of Technology,Jinzhong 030600,China)

Received:2019-08-04 Revised:2019-11-01 Online:2020-03-25 Published:2020-03-25

摘要/Abstract

摘要：

自适应随机森林分类器在每个基础分类器上分别设置了警告探测器和漂移探测器，实例训练时常常会同时触发多个警告探测器，引起多棵背景树同步训练，使得运行所需的内存大、时间长。针对此问题，提出了一种改进的自适应随机森林集成分类算法,将概念漂移探测器设置在集成学习器端，移除各基础树端的漂移探测器，并根据集成器预测准确率确定需要训练的背景树的数量。用改进后的算法对较平衡的数据流进行分类，在保证分类性能的前提下，与改进前的算法相比，运行时间有所降低，消耗内存有所减少，能更快适应数据流中出现的概念漂移。

关键词: 数据流, 概念漂移, 随机森林, 漂移探测器, 集成分类器

Abstract:

The adaptive random forest classifier sets a warning detector and a drift detector on each basic classifier. When the instance is being trained, multiple warning detectors are often triggered at the same time, causing multiple background trees to be trained simultaneously, which requires large memory and long running time. Aiming at this problem, this paper proposes an improved adaptive random forest ensemble classification algorithm. It sets the concept drift detector in the ensemble learning device, removes the drift detectors at each base tree, and determines the number of background trees according to the ensemble prediction accuracy. The improved algorithm classifies balanced data streams. Under the premise of ensuring the classification performance, the running time and the memory consumption is reduced, and the concept drift appearing in the data stream can be more quickly adapted.

Key words: data stream, concept drift, random forest, drift detector, ensemble classifier

张馨予, 安建成, 曹锐. 基于自适应随机森林的数据流分类算法[J]. 计算机工程与科学.

ZHANG Xin-yu, AN Jian-cheng, CAO Rui.

A data stream classification algorithm based

on adaptive random forest ensemble model

[J]. Computer Engineering & Science.

[1]	苏宇杭, 马俊, 樊津瑜, 陈博行, 周家城, 尹博然. 基于GATv2-TCN联合优化的WSN数据流异常检测算法[J]. 计算机工程与科学, 2025, 47(05): 843-850.
[2]	陈子雄, 陈旭, 景永俊, 宋吉飞. 基于图神经网络的源代码漏洞检测研究综述[J]. 计算机工程与科学, 2024, 46(10): 1775-1792.
[3]	李金熹, 尹首一, 魏少军, 胡杨. 基于MLIR的数据流模型[J]. 计算机工程与科学, 2024, 46(07): 1151-1157.
[4]	柴旭清, 乔一航, 范黎林, . 一种基于随机森林分类器构建高性能应用程序性能分析模型的方法[J]. 计算机工程与科学, 2024, 46(07): 1218-1228.
[5]	张家豪, 邓金易, 尹首一, 魏少军, 胡杨. 基于Actor模型的众核数据流硬件架构探索[J]. 计算机工程与科学, 2024, 46(06): 959-967.
[6]	唐宇, 代琪, 杨志伟, 杨爱民, 陈丽芳, . 基于优化随机森林的软件缺陷预测算法研究[J]. 计算机工程与科学, 2023, 45(05): 830-839.
[7]	胡艳芳, 熊文, 高炜. 基于 Spark 平台的网络游戏用户流失预测方法[J]. 计算机工程与科学, 2022, 44(10): 1730-1737.
[8]	丁滟, 王闯, 冯了了, 王锋, 常俊胜. 基于区块链监管的联盟数据可信流通[J]. 计算机工程与科学, 2022, 44(10): 1771-1780.
[9]	张喜龙, 韩萌, 陈志强, 武红鑫, 李慕航. 基于Hellinger距离的不平衡漂移数据流Boosting分类算法[J]. 计算机工程与科学, 2022, 44(05): 788-799.
[10]	徐礼金, 贺艳芳. 基于随机森林算法的无线传感网络攻击流量阻断模型构建[J]. 计算机工程与科学, 2022, 44(05): 819-825.
[11]	乔冠杰, 吕高锋, 谭靖, 莫露莎. 大规模数据流统计中冷热流替换策略优化[J]. 计算机工程与科学, 2021, 43(09): 1567-1573.
[12]	朱广林, 吕方, 赖庆宽, 陈华英, 何先波, . 编译器中激进蝴蝶优化方法的研究与实现[J]. 计算机工程与科学, 2021, 43(06): 962-968.
[13]	林涛, 张达, 王建君. 改进LSTM-RF算法的传感器故障诊断与数据重构研究[J]. 计算机工程与科学, 2021, 43(05): 845-852.
[14]	文凯, 耿小海, 朱璐伟, 许萌萌, . 基于AO算法的数据流频繁项集挖掘[J]. 计算机工程与科学, 2020, 42(12): 2259-2264.
[15]	熊菊霞1,2,3,吴尽昭1,2,3. 异构复杂信息网络敏感数据流动态挖掘[J]. 计算机工程与科学, 2020, 42(04): 628-633.