• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于Hadoop的多维关联规则挖掘算法研究及应用

杨青1,2,3,张亚文1,2,张琴1,袁佩玲1   

  1. (1.华中师范大学计算机学院,湖北 武汉 430079;2.人工智能与智慧学习湖北省重点实验室,湖北 武汉 430079;
    3.国家语言资源监测与研究网络媒体中心,湖北 武汉 430079)
     
  • 收稿日期:2019-07-07 修回日期:2019-09-17 出版日期:2019-12-25 发布日期:2019-12-25
  • 基金资助:

    国家自然科学基金(61532008);国家重点研发计划(2017YFC0909502)

Research and application of a multidimensional
association rules mining algorithm based on Hadoop

 YANG Qing1,2,3,ZHANG Ya-wen1,2,ZHANG Qin1,YUAN Pei-ling1   

  1. (1.School of Computer,Central China Normal University,Wuhan 430079;
    2.Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning,Wuhan 430079;
    3.National Language Resources Monitor & Research Center for Network Media,Wuhan 430079,China)
  • Received:2019-07-07 Revised:2019-09-17 Online:2019-12-25 Published:2019-12-25

摘要:

传统的Apriori算法要多次扫描数据集,随着数据量的快速增长,传统的Apriori算法已经不能很好地适用于大数据分析,针对该情况设计了IPApriori算法。首先通过剪枝策略设计了一种适用于多维数据的IApriori算法,再将IApriori算法与Hadoop分布式框架相结合,实现了多维关联规则挖掘算法的并行化。将IPApriori算法运用到手机用户行为预测关联分析中,分析影响手机用户行为的一些主要因素,挖掘出手机用户行为与年龄维度、性别维度、时间维度、地点维度和手机品牌维度属性之间可能存在的某种关联。最后通过实验证明,算法的并行化和建立结构的方法可以降低系统的I/O负荷,提高算法的执行效率。
 

关键词: Apriori算法, Hadoop, 多维关联规则, 并行化

Abstract:

The traditional Apriori algorithm has to scan the data set multiple times. With the rapid growth of data volume, it cannot be applied to big data analysis. For this problem, an improved parallel Apriori algorithm is designed. Firstly, an IApriori algorithm for multidimensional data is designed by pruning strategy. Secondly, the IApriori algorithm is combined with the Hadoop distributed framework to realize the parallelization of multidimensional association rules mining algorithm. This paper applies the IPApriori algorithm to the correlation analysis of mobile phone user behavior prediction, analyzes some main factors affecting the behavior of mobile phone users, and discovers the possible correlation between mobile phone user behavior and some attributes such as age dimension, gender dimension, time dimension, location dimension and mobile phone brand dimension. Finally, experiments prove that this parallelization algorithm process and the structure building method can reduce the I/O load of the system and improve the execution efficiency of the algorithm.
 

Key words: Apriori algorithm, Hadoop, multidimensional association rules, parallelization