Computer Engineering & Science
Previous Articles Next Articles
ZHANG Yuan-ming,CHEN Miao,LU Jia-wei,XU-Jun,XIAO Gang
Received:
Revised:
Online:
Published:
Abstract:
In order to address the shortcomings of overfitting and poor scalability of the C4.5 decision tree algorithm, we propose an optimized C4.5 algorithm with Bagging technique, and then parallelize it according to the MapReduce model. The optimized algorithm can obtain multiple new training sets that are equal to the initial training set by sampling with replacement. Multiple classifiers can be obtained by training the algorithm with these new training sets. A final classifier is generated according to a majority voting rule that integrates the training results. Then, the optimized algorithm is parallelized in three aspects, including parallel processing training sets, parallel selecting optimal decomposition attributes and optimal decomposition point, and parallel generating child nodes. A parallel algorithm based on job workflow is implemented to improve the ability of big data analysis. Experimental results show that the parallel and optimized decision tree algorithm has higher accuracy, higher sensitivity, better scalability and higher performance.
Key words: decision tree, Bagging, MapReduce model, big data analysis, accuracy
ZHANG Yuan-ming,CHEN Miao,LU Jia-wei,XU-Jun,XIAO Gang.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2017/V39/I05/841