一种基于综合不放回抽样的随机森林算法改进

J4 ›› 2015, Vol. 37 ›› Issue (07): 1233-1238.

一种基于综合不放回抽样的随机森林算法改进

李慧，李正，佘堃

（电子科技大学计算机科学与工程学院，四川成都 611731）

收稿日期:2014-08-05 修回日期:2015-04-03 出版日期:2015-07-25 发布日期:2015-07-25
基金资助:
四川省科技支撑计划资助项目（2015GZ0102）

An improvement of random forests algorithm based on
comprehensive sampling without replacement

LI Hui,LI Zheng,SHE Kun

（School of Computer Science and Engineering,University
of Electronic Science and Technology of China,Chengdu 611731,China）

Received:2014-08-05 Revised:2015-04-03 Online:2015-07-25 Published:2015-07-25

摘要/Abstract

摘要：

数据挖掘是大数据服务计算的一个重要方法，对于优化服务计算有重要意义。作为一种典型的数据挖掘方法，随机森林有着较高的正确率，因而得到广泛的应用。为了更加准确高效地处理服务计算中的大数据问题，进一步提升随机森林的正确率和效率，成为一项极其重要的研究。通过改变训练集的样本量和样本抽样方法，对平衡样本集和不平衡样本集进行分析，发现通过上述两个改进后，在优化区间内，平衡样本集泛化误差会减小12%~20%；单项改变抽样方法，可以使算法时间缩短，提升效率达10%~40%；对不平衡数据，也能够明显提升效率。理论和实验均证明，基于综合不放回抽样的随机森林算法改进能够提升平衡样本的正确率，使得该数据挖掘方法更适用于服务计算中的大数据分析和处理。

关键词: 随机森林, 平衡数据, 不平衡数据, 不重复抽样

Abstract:

Data mining is an important method in big data and service computing. As a typical method in data mining, random forest is widely used due to its low error rate. In order to dealing with big data more accurately and efficiently, we make a further improvement in the accuracy and efficiency of the random forest. It demonstrates both theoretically and practically that our method can decrease the generalization error by about 12%~20% when the number we choose for replacement is beyond the number of the samples. Moreover, we replace the method of repeated sampling with a simple method, which proves equal to the method of repeated sampling. By this way, we can decrease the time of building the forest, thus promoting the efficiency by about 10%~40% when it is used alone. And this method can just make up for the efficiency loss of the first improvement. Combing the two aforementioned methods, we promote the efficiency of the unbalanced data by 10%, and improve the accuracy of the balanced data over 12% without any impact on the efficiency. Therefore, the proposed method is more suitable for big data analysis and processing in service computing than the original method.

Key words: random forest;balanced data;unbalanced data;sampling without replacement

李慧，李正，佘堃. 一种基于综合不放回抽样的随机森林算法改进[J]. J4, 2015, 37(07): 1233-1238.

LI Hui,LI Zheng,SHE Kun. An improvement of random forests algorithm based on
comprehensive sampling without replacement [J]. J4, 2015, 37(07): 1233-1238.

编辑推荐

Metrics

阅读次数

全文

312

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	312

来源	本网站	其他网站

次数	246	66
比例	79%	21%

摘要

156

最新录用	在线预览	正式出版

0	0	156

	来源	本网站

	次数	156
	比例	100%

[1]	柴旭清, 乔一航, 范黎林, . 一种基于随机森林分类器构建高性能应用程序性能分析模型的方法[J]. 计算机工程与科学, 2024, 46(07): 1218-1228.
[2]	唐宇, 代琪, 杨志伟, 杨爱民, 陈丽芳, . 基于优化随机森林的软件缺陷预测算法研究[J]. 计算机工程与科学, 2023, 45(05): 830-839.
[3]	胡艳芳, 熊文, 高炜. 基于 Spark 平台的网络游戏用户流失预测方法[J]. 计算机工程与科学, 2022, 44(10): 1730-1737.
[4]	马汉达, 朱敏. 改进SVM不平衡数据分类的IGWOSMOTE方法[J]. 计算机工程与科学, 2022, 44(06): 1133-1140.
[5]	张喜龙, 韩萌, 陈志强, 武红鑫, 李慕航. 基于Hellinger距离的不平衡漂移数据流Boosting分类算法[J]. 计算机工程与科学, 2022, 44(05): 788-799.
[6]	徐礼金, 贺艳芳. 基于随机森林算法的无线传感网络攻击流量阻断模型构建[J]. 计算机工程与科学, 2022, 44(05): 819-825.
[7]	董宏成, 文志云, 万玉辉, 晏飞扬, . 基于DPC聚类重采样结合ELM的不平衡数据分类算法[J]. 计算机工程与科学, 2021, 43(10): 1856-1863.
[8]	林涛, 张达, 王建君. 改进LSTM-RF算法的传感器故障诊断与数据重构研究[J]. 计算机工程与科学, 2021, 43(05): 845-852.
[9]	陈丽芳, 代琪, 赵佳亮. 不平衡数据多粒度集成分类算法研究[J]. 计算机工程与科学, 2021, 43(05): 917-925.
[10]	张馨予, 安建成, 曹锐. 基于自适应随机森林的数据流分类算法[J]. 计算机工程与科学, 2020, 42(03): 543-549.
[11]	李克文1，林亚林1，杨耀忠2. 一种改进的基于欧氏距离的SDRSMOTE算法[J]. 计算机工程与科学, 2019, 41(11): 2063-.
[12]	张忠林，吴挡平. 基于概率阈值Bagging算法的不平衡数据分类方法[J]. 计算机工程与科学, 2019, 41(06): 1086-1094.
[13]	杨杰超，许江淳，岳秋燕，曾德斌，陆万荣. 基于SPARK与随机森林的短信诈骗用户识别研究[J]. 计算机工程与科学, 2019, 41(06): 1136-1144.
[14]	张晓龙1,2,3，彭宜1,2,3. 基于残差网络和随机森林的音频识别方法[J]. 计算机工程与科学, 2019, 41(04): 727-732.
[15]	任胜兵，廖湘荡. 基于代价敏感支持向量机的软件缺陷预测研究[J]. 计算机工程与科学, 2018, 40(10): 1787-1795.