• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

An optimized bagging decision tree
algorithm based on MapReduce

ZHANG Yuan-ming,CHEN Miao,LU Jia-wei,XU-Jun,XIAO Gang   

  1. (College of Computer Science & Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Received:2017-01-20 Revised:2017-03-25 Online:2017-05-25 Published:2017-05-25

Abstract:

In order to address the shortcomings of overfitting and poor scalability of the C4.5 decision tree algorithm, we propose an optimized C4.5 algorithm with Bagging technique, and then parallelize it according to the MapReduce model. The optimized algorithm can obtain multiple new training sets that are equal to the initial training set by sampling with replacement. Multiple classifiers can be obtained by training the algorithm with these new training sets. A final classifier is generated according to a majority voting rule that integrates the training results. Then, the optimized algorithm is parallelized in three aspects, including parallel processing training sets, parallel selecting optimal decomposition attributes and optimal decomposition point, and parallel generating child nodes. A parallel algorithm based on job workflow is implemented to improve the ability of big data analysis. Experimental results show that the parallel and optimized decision tree algorithm has higher accuracy, higher sensitivity, better scalability and higher performance.

Key words: decision tree, Bagging, MapReduce model, big data analysis, accuracy