• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于MapReduce的Bagging决策树优化算法

张元鸣,陈苗,陆佳炜,徐俊,肖刚   

  1. (浙江工业大学计算机科学与技术学院,浙江 杭州 310023)
  • 收稿日期:2017-01-20 修回日期:2017-03-25 出版日期:2017-05-25 发布日期:2017-05-25
  • 基金资助:

    浙江省重大科技专项(2014C01408);浙江省公益性技术项目(2017C31014)

An optimized bagging decision tree
algorithm based on MapReduce

ZHANG Yuan-ming,CHEN Miao,LU Jia-wei,XU-Jun,XIAO Gang   

  1. (College of Computer Science & Technology,Zhejiang University of Technology,Hangzhou 310023,China)
  • Received:2017-01-20 Revised:2017-03-25 Online:2017-05-25 Published:2017-05-25

摘要:

针对经典C4.5决策树算法存在过度拟合和伸缩性差的问题,提出了一种基于Bagging的决策树改进算法,并基于MapReduce模型对改进算法进行了并行化。首先,基于Bagging技术对C4.5算法进行了改进,通过有放回采样得到多个与初始训练集大小相等的新训练集,并在每个训练集上进行训练,得到多个分类器,再根据多数投票规则集成训练结果得到最终的分类器;然后,基于MapReduce模型对改进算法进行了并行化,能够并行化处理训练集、并行选择最佳分割属性和最佳分割点,以及并行生成子节点,实现了基于MapReduce Job工作流的并行决策树改进算法,提高了对大数据集的分析能力。实验结果表明,并行Bagging决策树改进算法具有较高的准确度与敏感度,以及较好的伸缩性和加速比。

关键词: 决策树, Bagging, MapReduce模型, 大数据分析, 准确性

Abstract:

In order to address the shortcomings of overfitting and poor scalability of the C4.5 decision tree algorithm, we propose an optimized C4.5 algorithm with Bagging technique, and then parallelize it according to the MapReduce model. The optimized algorithm can obtain multiple new training sets that are equal to the initial training set by sampling with replacement. Multiple classifiers can be obtained by training the algorithm with these new training sets. A final classifier is generated according to a majority voting rule that integrates the training results. Then, the optimized algorithm is parallelized in three aspects, including parallel processing training sets, parallel selecting optimal decomposition attributes and optimal decomposition point, and parallel generating child nodes. A parallel algorithm based on job workflow is implemented to improve the ability of big data analysis. Experimental results show that the parallel and optimized decision tree algorithm has higher accuracy, higher sensitivity, better scalability and higher performance.

Key words: decision tree, Bagging, MapReduce model, big data analysis, accuracy