A parallel FP-Growth mining algorithm
based on Spark framework

Abstract

Abstract:

The Apriori and FP-Growth are classical algorithms in frequent pattern mining. Since the Apriori has more flaws, the FP-Growth is a more efficient algorithm in stand-alone computing environment. Aiming at the bottlenecks of non-parallel computing in the era of big data, we propose a balanced parallel frequent pattern (BPFT) growth algorithm based on the connect-weight (CW) matrix of items in each transaction, called CWBPFP, which achieves parallel computing based on Spark framework. We use the load balance strategy to group data, and the corresponding code of each frequent item is stored in the relevant group during grouping. The connect information of items in each transaction of each grouped data is stored into a lower triangular connect-weight matrix by each working node. We use the restricted sub-tree to accelerate the speed of producing conditional FP-tree, and employ the connect-weight matrix to avoid the first scanning for the conditional patterns during mining frequent patterns of grouped data. The performance of the parallel mining FP-tree is improved due to the combination of the CW matrix and the restricted sub-tree applied to FP-tree mining process of each node. Experiments show that the CWBPFP has high performance and scalability on big data sets.

Key words: data mining, association rule, FP-Growth, big data, parallel computing, Spark

ZHANG Wen,LUO Ke.

A parallel FP-Growth mining algorithm

based on Spark framework

[J]. Computer Engineering & Science.

[1]	CHEN Qiaoan1,LI Feng1,CAO Yue1,LONG Mingsheng1,2. Parameter optimization for Spark jobs based on runtime data analysis [J]. J4, 20160101, 38(01): 11-19.
[2]	DAI Chang-wei, KONG Rui-lin, JI Zhe, . A parallel fast neighbor searching algorithm for particle-based methods on CPU and GPU architectures in multi-scale simulation [J]. Computer Engineering & Science, 2024, 46(08): 1349-1360.
[3]	ZHAO Yan, MA Hui-fang, WANG Wen-tao, TONG Hai-bin, HE Xiang-chun. A reliable response representation enhanced knowledge tracing method [J]. Computer Engineering & Science, 2024, 46(03): 535-544.
[4]	WU Chao, WEI Qian, ZHOU Jun-wei, LI Hui-min, SUN Guang-zhong. A parallel ambient noise data preprocessing algorithm based on heterogenous computing platform [J]. Computer Engineering & Science, 2023, 45(10): 1711-1719.
[5]	WANG Xin, PENG Jian. Implementation and optimization of HYB-based SpMV on the new-generation Sunway architecture [J]. Computer Engineering & Science, 2023, 45(10): 1754-1762.
[6]	WANG Xing-su, XIONG Wen, ZHANG Rui. A massive subway passenger trajectory similarity connection method:A case study of Shenzhen metro [J]. Computer Engineering & Science, 2023, 45(08): 1383-1392.
[7]	LIU Yi-cheng, LIU Xiao-yan, YAN Xin. A parallel balanced cascade support vector machine [J]. Computer Engineering & Science, 2023, 45(07): 1170-1177.
[8]	LEI Xuan, CHENG Guang, ZHANG Yu-jian, GUO Liang, ZHANG Fu-cun. Association analysis of alarm information based on power network situation awareness platform [J]. Computer Engineering & Science, 2023, 45(07): 1197-1208.
[9]	YANG Hao-yi, CHEN Wei, YAO Ze-huan, TAN Yu-song, LI Fei. Antifungal drug discovery base on transcriptome data of cell response [J]. Computer Engineering & Science, 2023, 45(02): 246-251.
[10]	WANG Chen-yu, WEN Hao-min, GUO Sheng-nan, LIN You-fang, WAN Huai-yu, . Multi-task deep spatial-temporal networkfor couriers pick-up arrival time prediction [J]. Computer Engineering & Science, 2023, 45(01): 136-144.
[11]	ZANG Zhao-hu, LI Chen, WANG Yao-hua, CHEN Xiao-wen, GUO Yang . A hierarchical hardware barrier synchronization design for many-core processors [J]. Computer Engineering & Science, 2022, 44(11): 1901-1908.
[12]	ZHANG Yong , ZHANG Xi , WAN Yun-bo , HE Xian-yao , ZHAO Zhong , LU Yu-tong. Optimizations of mesh renumbering for unstructured finite-volume computational fluid dynamics [J]. Computer Engineering & Science, 2022, 44(10): 1721-1729.
[13]	HU Yan-fang, XIONG Wen, GAO Wei. An online game user churn prediction method based on Spark platform [J]. Computer Engineering & Science, 2022, 44(10): 1730-1737.
[14]	CHENG Xiao-gang, GUO Ren, ZHOU Chang-li, . A distributed privacy-preserving data mining framework based on rational cryptography [J]. Computer Engineering & Science, 2022, 44(10): 1781-1787.
[15]	WANG Wen-tao, MA Hui-fang, SHU Yue-yu, HE Xiang-chun. Knowledge tracing based on contextualized representation [J]. Computer Engineering & Science, 2022, 44(09): 1693-1701.

A parallel FP-Growth mining algorithm

based on Spark framework

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments