一种基于Spark框架的并行FP-Growth挖掘算法

计算机工程与科学

一种基于Spark框架的并行FP-Growth挖掘算法

张稳，罗可

（长沙理工大学计算机与通信工程学院,湖南长沙 410114）

收稿日期:2015-12-11 修回日期:2016-03-31 出版日期:2017-08-25 发布日期:2017-08-25
基金资助:
国家自然科学基金（71371065,11671125）;湖南省科技计划项目（2013SK3146）

A parallel FP-Growth mining algorithm

based on Spark framework

ZHANG Wen,LUO Ke

(School of Computer & Communication Engineering,

Changsha University of Science & Technology,Changsha 410114,China)

Received:2015-12-11 Revised:2016-03-31 Online:2017-08-25 Published:2017-08-25

摘要/Abstract

摘要：

Apriori和FP-Growth算法是频繁模式挖掘中的经典算法，由于Apriori存在更多缺陷，因此FP-Growth是单机计算环境下比较高效的算法。然而，对于非并行计算在大数据时代遇到的瓶颈，提出一种基于事务中项间联通权重矩阵的负载平衡并行频繁模式增长算法CWBPFP。算法在Spark框架上实现并行计算，数据分组时利用负载均衡策略,存入分组的数据是相应频繁项的编码。每个工作节点将分组数据中每一个事物中项的联通信息存入一个下三角联通权重矩阵中，使用被约束子树来加快每个工作节点挖掘频繁模式时创建条件FP-tree的速度，再用联通权重矩阵避免每次挖掘分组中频繁模式时对条件模式基的第一次扫描。由于联通权重矩阵和被约束子树的结合应用于每一个工作节点的FP-tree挖掘过程，因此提升了并行挖掘FP-tree性能。通过实验表明，所提出的并行算法对大的数据有较高性能和可扩展性。

关键词: 数据挖掘, 关联规则, FP-Growth, 大数据, 并行计算, Spark

Abstract:

The Apriori and FP-Growth are classical algorithms in frequent pattern mining. Since the Apriori has more flaws, the FP-Growth is a more efficient algorithm in stand-alone computing environment. Aiming at the bottlenecks of non-parallel computing in the era of big data, we propose a balanced parallel frequent pattern (BPFT) growth algorithm based on the connect-weight (CW) matrix of items in each transaction, called CWBPFP, which achieves parallel computing based on Spark framework. We use the load balance strategy to group data, and the corresponding code of each frequent item is stored in the relevant group during grouping. The connect information of items in each transaction of each grouped data is stored into a lower triangular connect-weight matrix by each working node. We use the restricted sub-tree to accelerate the speed of producing conditional FP-tree, and employ the connect-weight matrix to avoid the first scanning for the conditional patterns during mining frequent patterns of grouped data. The performance of the parallel mining FP-tree is improved due to the combination of the CW matrix and the restricted sub-tree applied to FP-tree mining process of each node. Experiments show that the CWBPFP has high performance and scalability on big data sets.

Key words: data mining, association rule, FP-Growth, big data, parallel computing, Spark

张稳，罗可. 一种基于Spark框架的并行FP-Growth挖掘算法[J]. 计算机工程与科学.

ZHANG Wen,LUO Ke.

A parallel FP-Growth mining algorithm

based on Spark framework

[J]. Computer Engineering & Science.

[1]	陈侨安1，李峰1，曹越1，龙明盛1,2. 基于运行数据分析的Spark任务参数优化[J]. J4, 20160101, 38(01): 11-19.
[2]	钟权, 陈志广, 高蓝光. EMRI-Tree：面向多分辨率可视化的层次式数据结构[J]. 计算机工程与科学, 2024, 46(05): 776-784.
[3]	赵琰, 马慧芳, 王文涛, 童海斌, 贺相春. 可靠响应表示增强的知识追踪方法[J]. 计算机工程与科学, 2024, 46(03): 535-544.
[4]	吴超, 卫谦, 周俊伟, 李会民, 孙广中. 基于异构计算平台的背景噪声预处理并行算法[J]. 计算机工程与科学, 2023, 45(10): 1711-1719.
[5]	王鑫, 彭健. 基于HYB格式SpMV在新一代申威架构上的实现与优化[J]. 计算机工程与科学, 2023, 45(10): 1754-1762.
[6]	王星苏, 熊文, 张瑞. 海量地铁乘客轨迹相似性连接方法：以深圳地铁为例[J]. 计算机工程与科学, 2023, 45(08): 1383-1392.
[7]	刘屹成, 刘晓燕, 严馨. 并行平衡级联支持向量机[J]. 计算机工程与科学, 2023, 45(07): 1170-1177.
[8]	雷轩, 程光, 张玉健, 郭靓, 张付存. 基于电力网络态势感知平台的告警信息关联分析[J]. 计算机工程与科学, 2023, 45(07): 1197-1208.
[9]	杨浩艺, 陈微, 姚泽欢, 谭郁松, 李非. 基于转录组学数据的抗真菌药物预测方法研究[J]. 计算机工程与科学, 2023, 45(02): 246-251.
[10]	王晨宇, 温浩珉, 郭晟楠, 林友芳, 万怀宇, . 面向快递员揽收到达时间预测的多任务深度时空网络[J]. 计算机工程与科学, 2023, 45(01): 136-144.
[11]	臧照虎, 李晨, 王耀华, 陈小文, 郭阳. 面向众核系统的层次化栅栏同步机制[J]. 计算机工程与科学, 2022, 44(11): 1901-1908.
[12]	张勇, 张曦, 万云博, 何先耀, 赵钟, 卢宇彤. 非结构有限体积CFD计算的网格重排序优化[J]. 计算机工程与科学, 2022, 44(10): 1721-1729.
[13]	胡艳芳, 熊文, 高炜. 基于 Spark 平台的网络游戏用户流失预测方法[J]. 计算机工程与科学, 2022, 44(10): 1730-1737.
[14]	程小刚, 郭韧, 周长利, . 基于理性密码学的分布式隐私保护数据挖掘框架[J]. 计算机工程与科学, 2022, 44(10): 1781-1787.
[15]	王文涛, 马慧芳, 舒跃育, 贺相春. 基于上下文表示的知识追踪方法[J]. 计算机工程与科学, 2022, 44(09): 1693-1701.