• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A parallel FP-Growth mining algorithm
based on Spark framework

ZHANG Wen,LUO Ke   

  1. (School of Computer & Communication Engineering,
    Changsha University of Science & Technology,Changsha 410114,China)
  • Received:2015-12-11 Revised:2016-03-31 Online:2017-08-25 Published:2017-08-25

Abstract:

The Apriori and FP-Growth are classical algorithms in frequent pattern mining. Since the Apriori has more flaws, the FP-Growth is a more efficient algorithm in stand-alone computing environment. Aiming at the bottlenecks of non-parallel computing in the era of big data, we propose a balanced parallel frequent pattern (BPFT) growth algorithm based on the connect-weight (CW) matrix of items in each transaction, called CWBPFP, which achieves parallel computing based on Spark framework. We use the load balance strategy to group data, and the corresponding code of each frequent item is stored in the relevant group during grouping. The connect information of items in each transaction of each grouped data is stored into a lower triangular connect-weight matrix by each working node. We use the restricted sub-tree to accelerate the speed of producing conditional FP-tree, and employ the connect-weight matrix to avoid the first scanning for the conditional patterns during mining frequent patterns of grouped data. The performance of the parallel mining FP-tree is improved due to the combination of the CW matrix and the restricted sub-tree applied to FP-tree mining process of each node. Experiments show that the CWBPFP has high performance and scalability on big data sets.

Key words: data mining, association rule, FP-Growth, big data, parallel computing, Spark