• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (07): 1233-1238.

• 论文 • 上一篇    下一篇

一种基于综合不放回抽样的随机森林算法改进

李慧,李正,佘堃   

  1. (电子科技大学计算机科学与工程学院,四川 成都 611731)
  • 收稿日期:2014-08-05 修回日期:2015-04-03 出版日期:2015-07-25 发布日期:2015-07-25
  • 基金资助:

    四川省科技支撑计划资助项目(2015GZ0102)

An improvement of random forests algorithm based on
comprehensive sampling without replacement  

LI Hui,LI Zheng,SHE Kun   

  1. (School of Computer Science and Engineering,University
    of Electronic Science and Technology of China,Chengdu 611731,China)
  • Received:2014-08-05 Revised:2015-04-03 Online:2015-07-25 Published:2015-07-25

摘要:

数据挖掘是大数据服务计算的一个重要方法,对于优化服务计算有重要意义。作为一种典型的数据挖掘方法,随机森林有着较高的正确率,因而得到广泛的应用。为了更加准确高效地处理服务计算中的大数据问题,进一步提升随机森林的正确率和效率,成为一项极其重要的研究。通过改变训练集的样本量和样本抽样方法,对平衡样本集和不平衡样本集进行分析,发现通过上述两个改进后,在优化区间内,平衡样本集泛化误差会减小12%~20%;单项改变抽样方法,可以使算法时间缩短,提升效率达10%~40%;对不平衡数据,也能够明显提升效率。理论和实验均证明,基于综合不放回抽样的随机森林算法改进能够提升平衡样本的正确率,使得该数据挖掘方法更适用于服务计算中的大数据分析和处理。

关键词: 随机森林, 平衡数据, 不平衡数据, 不重复抽样

Abstract:

Data mining is an important method in big data and service computing. As a typical method in data mining, random forest is widely used due to its low error rate. In order to dealing with big data more accurately and efficiently, we make a further improvement in the accuracy and efficiency of the random forest. It demonstrates both theoretically and practically that our method can decrease the generalization error by about 12%~20% when the number we choose for replacement is beyond the number of the samples. Moreover, we replace the method of repeated sampling with a simple method, which proves equal to the method of repeated sampling. By this way, we can decrease the time of building the forest, thus promoting the efficiency by about 10%~40% when it is used alone. And this method can just make up for the efficiency loss of the first improvement. Combing the two aforementioned methods, we promote the efficiency of the unbalanced data by 10%, and improve the accuracy of the balanced data over 12% without any impact on the efficiency. Therefore, the proposed method is more suitable for big data analysis and processing in service computing than the original method.  

Key words: random forest;balanced data;unbalanced data;sampling without replacement