• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于Spark的并行化高效用项集挖掘算法

何登平1,2,3,何宗浩1,2,李培强1,2   

  1. (1.重庆邮电大学通信与信息工程学院,重庆 400065;2.重庆邮电大学通信新技术应用研究中心,重庆 400065;
    3.重庆信科设计有限公司,重庆 400021)
     
  • 收稿日期:2019-03-19 修回日期:2019-04-25 出版日期:2019-10-25 发布日期:2019-10-25

A parallelhigh utility itemset mining
algorithm based on Spark
 

HE Deng-ping1,2,3,HE Zong-hao1,2,LI Pei-qiang1,2   

  1. (1.School of Telecommunication and Information Engineering,
    Chongqing University of Posts and Telecommunications Chongqing 400065;
    2.Research Center of New Telecommunication Technology Applications,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;
    3.Chongqing Information Technology Designing Company Limited,Chongqing 400021,China )
  • Received:2019-03-19 Revised:2019-04-25 Online:2019-10-25 Published:2019-10-25

摘要:

针对传统基于链表结构的Top-K高效用挖掘算法在大数据环境下不能满足挖掘需求的问题,提出一种基于Spark的并行化高效用项集挖掘算法(STKO)。首先从阈值提升、搜索空间缩小等方面对TKO算法进行改进;然后选择Spark平台,改变原有数据存储结构,利用广播变量优化迭代过程,在避免大量重新计算的同时使用负载均衡思想实现Top-K高效用项集的并行挖掘。实验结果表明,该并行算法能有效地挖掘出大数据集中的高效用项集。

关键词: 数据挖掘, 高效用项集, Spark大数据框架, 并行化, Top-K

Abstract:

Aiming at the problem that the traditional Top-K high utility mining algorithms based  linked list structure can not meet the mining requirements in the big data environment, a parallel high utility itemset mining algorithm based on Spark (STKO) is proposed. Firstly, the TKO algorithm is improved by increasing the threshold increase and reducing the search space. Then, based on the Spark platform, the original data storage structure is changed and broadcast variables are used to  optimize the iterative process,so as to avoid a large number of recalculations and use the load balancing idea to realize parallel mining of Top-K high utility itemsets. The experimental results show that the proposed algorithm can effectively mine the high utility item sets in the big data sets.
 

Key words: data mining, high utility itemset, Spark big data framework, parallelization, Top-K