• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

稀疏数据频繁项集挖掘算法研究综述

肖文,胡娟   

  1. (河海大学文天学院,安徽 马鞍山 243031)
  • 收稿日期:2018-08-10 修回日期:2018-10-18 出版日期:2019-05-25 发布日期:2019-05-25
  • 基金资助:

    国家自然科学基金(61202076);北京市自然科学基金(4192007)

A survey of frequent itemset mining
algorithms for sparse dataset

XIAO Wen,HU Juan   

  1. (Wentian College,Hohai University,Maanshan 243031,China)
  • Received:2018-08-10 Revised:2018-10-18 Online:2019-05-25 Published:2019-05-25

摘要:

频繁项集挖掘FIM是最重要的数据挖掘任务之一,被挖掘数据集的特征对FIM算法的性能有着显著影响。在大数据时代,稀疏是大数据的典型特征之一,对传统FIM算法的性能带来严峻挑战。针对在稀疏数据中如何高效进行FIM的问题,从稀疏数据的特征出发,分析了稀疏数据对3种类型FIM算法性能的主要影响,对已经提出的稀疏数据FIM算法进行了综述,对算法中采用的优化策略进行了讨论,最后通过实验对代表性的稀疏数据FIM算法进行了性能分析。实验结果表明,采用伪构造策略的模式增长算法最适合用于稀疏数据的FIM,在运算时间和存储空间上,相比其他算法该算法具有较大的优势。
 

关键词: 大数据, 稀疏数据, 频繁项集挖掘, 性能分析, 综述

Abstract:

Frequent itemset mining (FIM) is one of the most important data mining tasks. The characteristics of datasets have a significant impact on the performance of FIM algorithms. In the era of big data, sparseness, a typical feature of big data, brings severe challenges to the performance of traditional FIM algorithms. Aiming at the problem of how to perform FIM in sparse datasets efficiently, based on the characteristics of sparse datasets, we analyze the main effects of sparse datasets on the performance of three FIM algorithms, summarize current sparse datasets FIM algorithms, discuss the optimization strategies used in these algorithms, and analyse the performance of the typical sparse datasets FIM algorithms through experiments. Experimental results show that the pattern growth algorithm with pseudo-structural strategy is most suitable for FIM in sparse datasets and outperforms the other two algorithms in both operation time and storage space.

 

Key words: big data, sparse data, frequent itemset mining (FIM), performance analysis, survey