基于Spark的并行化高效用项集挖掘算法

计算机工程与科学

基于Spark的并行化高效用项集挖掘算法

何登平1，2，3，何宗浩1,2，李培强1,2

（1.重庆邮电大学通信与信息工程学院，重庆 400065；2.重庆邮电大学通信新技术应用研究中心，重庆 400065；

3.重庆信科设计有限公司，重庆 400021）

收稿日期:2019-03-19 修回日期:2019-04-25 出版日期:2019-10-25 发布日期:2019-10-25

A parallelhigh utility itemset mining

algorithm based on Spark

HE Deng-ping1,2,3，HE Zong-hao1,2，LI Pei-qiang1,2

(1.School of Telecommunication and Information Engineering,
Chongqing University of Posts and Telecommunications Chongqing 400065;

2.Research Center of New Telecommunication Technology Applications,
Chongqing University of Posts and Telecommunications,Chongqing 400065;

3.Chongqing Information Technology Designing Company Limited,Chongqing 400021,China )

Received:2019-03-19 Revised:2019-04-25 Online:2019-10-25 Published:2019-10-25

摘要/Abstract

摘要：

针对传统基于链表结构的Top-K高效用挖掘算法在大数据环境下不能满足挖掘需求的问题，提出一种基于Spark的并行化高效用项集挖掘算法(STKO)。首先从阈值提升、搜索空间缩小等方面对TKO算法进行改进；然后选择Spark平台，改变原有数据存储结构，利用广播变量优化迭代过程，在避免大量重新计算的同时使用负载均衡思想实现Top-K高效用项集的并行挖掘。实验结果表明，该并行算法能有效地挖掘出大数据集中的高效用项集。

关键词: 数据挖掘, 高效用项集, Spark大数据框架, 并行化, Top-K

Abstract:

Aiming at the problem that the traditional Top-K high utility mining algorithms based linked list structure can not meet the mining requirements in the big data environment, a parallel high utility itemset mining algorithm based on Spark (STKO) is proposed. Firstly, the TKO algorithm is improved by increasing the threshold increase and reducing the search space. Then, based on the Spark platform, the original data storage structure is changed and broadcast variables are used to optimize the iterative process,so as to avoid a large number of recalculations and use the load balancing idea to realize parallel mining of Top-K high utility itemsets. The experimental results show that the proposed algorithm can effectively mine the high utility item sets in the big data sets.

Key words: data mining, high utility itemset, Spark big data framework, parallelization, Top-K

何登平1，2，3，何宗浩1,2，李培强1,2. 基于Spark的并行化高效用项集挖掘算法[J]. 计算机工程与科学.

HE Deng-ping1,2,3，HE Zong-hao1,2，LI Pei-qiang1,2.

A parallelhigh utility itemset mining

algorithm based on Spark

[J]. Computer Engineering & Science.

[1]	杨航, 山蕊, 杨坤, 崔馨月. 基于动态自重构结构的3D-HEVC帧内预测算法并行化实现[J]. 计算机工程与科学, 2024, 46(11): 1931-1939.
[2]	杨仕琦, 武优西, 耿萌, 李艳. 一次性条件下的三支序列模式挖掘[J]. 计算机工程与科学, 2024, 46(07): 1286-1295.
[3]	郭宸良, 阎少宏, 宗晨琪. 线云隐私攻击算法的并行加速研究[J]. 计算机工程与科学, 2024, 46(04): 615-625.
[4]	赵琰, 马慧芳, 王文涛, 童海斌, 贺相春. 可靠响应表示增强的知识追踪方法[J]. 计算机工程与科学, 2024, 46(03): 535-544.
[5]	雷轩, 程光, 张玉健, 郭靓, 张付存. 基于电力网络态势感知平台的告警信息关联分析[J]. 计算机工程与科学, 2023, 45(07): 1197-1208.
[6]	王晨宇, 温浩珉, 郭晟楠, 林友芳, 万怀宇, . 面向快递员揽收到达时间预测的多任务深度时空网络[J]. 计算机工程与科学, 2023, 45(01): 136-144.
[7]	程小刚, 郭韧, 周长利, . 基于理性密码学的分布式隐私保护数据挖掘框架[J]. 计算机工程与科学, 2022, 44(10): 1781-1787.
[8]	王文涛, 马慧芳, 舒跃育, 贺相春. 基于上下文表示的知识追踪方法[J]. 计算机工程与科学, 2022, 44(09): 1693-1701.
[9]	刘云, 肖添. 网络日志数据中条件因果挖掘算法的优化研究[J]. 计算机工程与科学, 2021, 43(09): 1584-1590.
[10]	文凯, 许萌萌, 张许红, . 基于列表结构的加权可擦除项集挖掘算法[J]. 计算机工程与科学, 2021, 43(09): 1676-1683.
[11]	熊中敏, 汪博, 陶然, 郑宗生, 陈明, . 一种基于主属性判定的关联规则挖掘约简算法[J]. 计算机工程与科学, 2021, 43(04): 738-745.
[12]	藏润强, 左美云, 郭鑫鑫. 基于Doc2Vec和BiLSTM的老年患者疾病预测研究[J]. 计算机工程与科学, 2020, 42(12): 2273-2279.
[13]	何望1,2，林果园1,2. 基于FP-Growth改进算法的云服务器故障数据分析[J]. 计算机工程与科学, 2020, 42(05): 770-775.
[14]	谭胜昔，贾金萍，赵斌，吉根林. 动态空间网络中的黑洞模式挖掘算法[J]. 计算机工程与科学, 2020, 42(02): 325-333.
[15]	杨青1,2,3，张亚文1,2，张琴1，袁佩玲1. 基于Hadoop的多维关联规则挖掘算法研究及应用[J]. 计算机工程与科学, 2019, 41(12): 2127-2133.