• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A parallelhigh utility itemset mining
algorithm based on Spark
 

HE Deng-ping1,2,3,HE Zong-hao1,2,LI Pei-qiang1,2   

  1. (1.School of Telecommunication and Information Engineering,
    Chongqing University of Posts and Telecommunications Chongqing 400065;
    2.Research Center of New Telecommunication Technology Applications,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;
    3.Chongqing Information Technology Designing Company Limited,Chongqing 400021,China )
  • Received:2019-03-19 Revised:2019-04-25 Online:2019-10-25 Published:2019-10-25

Abstract:

Aiming at the problem that the traditional Top-K high utility mining algorithms based  linked list structure can not meet the mining requirements in the big data environment, a parallel high utility itemset mining algorithm based on Spark (STKO) is proposed. Firstly, the TKO algorithm is improved by increasing the threshold increase and reducing the search space. Then, based on the Spark platform, the original data storage structure is changed and broadcast variables are used to  optimize the iterative process,so as to avoid a large number of recalculations and use the load balancing idea to realize parallel mining of Top-K high utility itemsets. The experimental results show that the proposed algorithm can effectively mine the high utility item sets in the big data sets.
 

Key words: data mining, high utility itemset, Spark big data framework, parallelization, Top-K