• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (07): 1233-1238.

• 论文 • Previous Articles     Next Articles

An improvement of random forests algorithm based on
comprehensive sampling without replacement  

LI Hui,LI Zheng,SHE Kun   

  1. (School of Computer Science and Engineering,University
    of Electronic Science and Technology of China,Chengdu 611731,China)
  • Received:2014-08-05 Revised:2015-04-03 Online:2015-07-25 Published:2015-07-25

Abstract:

Data mining is an important method in big data and service computing. As a typical method in data mining, random forest is widely used due to its low error rate. In order to dealing with big data more accurately and efficiently, we make a further improvement in the accuracy and efficiency of the random forest. It demonstrates both theoretically and practically that our method can decrease the generalization error by about 12%~20% when the number we choose for replacement is beyond the number of the samples. Moreover, we replace the method of repeated sampling with a simple method, which proves equal to the method of repeated sampling. By this way, we can decrease the time of building the forest, thus promoting the efficiency by about 10%~40% when it is used alone. And this method can just make up for the efficiency loss of the first improvement. Combing the two aforementioned methods, we promote the efficiency of the unbalanced data by 10%, and improve the accuracy of the balanced data over 12% without any impact on the efficiency. Therefore, the proposed method is more suitable for big data analysis and processing in service computing than the original method.  

Key words: random forest;balanced data;unbalanced data;sampling without replacement