稀疏数据频繁项集挖掘算法研究综述

计算机工程与科学

稀疏数据频繁项集挖掘算法研究综述

肖文，胡娟

（河海大学文天学院，安徽马鞍山 243031）

收稿日期:2018-08-10 修回日期:2018-10-18 出版日期:2019-05-25 发布日期:2019-05-25
基金资助:
国家自然科学基金（61202076）；北京市自然科学基金（4192007）

A survey of frequent itemset mining

algorithms for sparse dataset

XIAO Wen，HU Juan

（Wentian College,Hohai University,Maanshan 243031,China）

Received:2018-08-10 Revised:2018-10-18 Online:2019-05-25 Published:2019-05-25

摘要/Abstract

摘要：

频繁项集挖掘FIM是最重要的数据挖掘任务之一,被挖掘数据集的特征对FIM算法的性能有着显著影响。在大数据时代，稀疏是大数据的典型特征之一，对传统FIM算法的性能带来严峻挑战。针对在稀疏数据中如何高效进行FIM的问题，从稀疏数据的特征出发，分析了稀疏数据对3种类型FIM算法性能的主要影响，对已经提出的稀疏数据FIM算法进行了综述，对算法中采用的优化策略进行了讨论，最后通过实验对代表性的稀疏数据FIM算法进行了性能分析。实验结果表明，采用伪构造策略的模式增长算法最适合用于稀疏数据的FIM，在运算时间和存储空间上，相比其他算法该算法具有较大的优势。

关键词: 大数据, 稀疏数据, 频繁项集挖掘, 性能分析, 综述

Abstract:

Frequent itemset mining (FIM) is one of the most important data mining tasks. The characteristics of datasets have a significant impact on the performance of FIM algorithms. In the era of big data, sparseness, a typical feature of big data, brings severe challenges to the performance of traditional FIM algorithms. Aiming at the problem of how to perform FIM in sparse datasets efficiently, based on the characteristics of sparse datasets, we analyze the main effects of sparse datasets on the performance of three FIM algorithms, summarize current sparse datasets FIM algorithms, discuss the optimization strategies used in these algorithms, and analyse the performance of the typical sparse datasets FIM algorithms through experiments. Experimental results show that the pattern growth algorithm with pseudo-structural strategy is most suitable for FIM in sparse datasets and outperforms the other two algorithms in both operation time and storage space.

Key words: big data, sparse data, frequent itemset mining (FIM), performance analysis, survey

肖文，胡娟. 稀疏数据频繁项集挖掘算法研究综述[J]. 计算机工程与科学.

XIAO Wen，HU Juan.

A survey of frequent itemset mining

algorithms for sparse dataset

[J]. Computer Engineering & Science.

[1]	陈侨安1，李峰1，曹越1，龙明盛1,2. 基于运行数据分析的Spark任务参数优化[J]. J4, 20160101, 38(01): 11-19.
[2]	柴旭清, 乔一航, 范黎林, . 一种基于随机森林分类器构建高性能应用程序性能分析模型的方法[J]. 计算机工程与科学, 2024, 46(07): 1218-1228.
[3]	钟权, 陈志广, 高蓝光. EMRI-Tree：面向多分辨率可视化的层次式数据结构[J]. 计算机工程与科学, 2024, 46(05): 776-784.
[4]	杨浩艺, 陈微, 姚泽欢, 谭郁松, 李非. 基于转录组学数据的抗真菌药物预测方法研究[J]. 计算机工程与科学, 2023, 45(02): 246-251.
[5]	葛旭冉, 刘洋, 陈志广, 肖侬. 基于MPI的并行大数据集生成器[J]. 计算机工程与科学, 2022, 44(07): 1152-1161.
[6]	刘世缘, 李云春, 陈晨, 杨海龙. 面向大数据存储的主动与被动相结合的性能评测方法体系结构与实现[J]. 计算机工程与科学, 2022, 44(04): 584-593.
[7]	杨柏蔼, 赵山, 刘芳. 无服务器计算技术研究综述[J]. 计算机工程与科学, 2022, 44(04): 611-619.
[8]	吕高锋, 王玉鹏, 杨鎔嘉, 唐竹. 基于聚合的FlowRadar网络数据采集加速模型设计[J]. 计算机工程与科学, 2022, 44(02): 220-226.
[9]	张元鸣, 虞家睿, 陆佳炜, 高飞, 肖刚. 基于Spark Streaming的视频大数据并行处理方法[J]. 计算机工程与科学, 2021, 43(10): 1736-1743.
[10]	黄山, 房六一, 徐浩桐, 段晓东, . 面向容器环境的Flink的任务调度优化研究[J]. 计算机工程与科学, 2021, 43(07): 1173-1184.
[11]	刘亚波, 吴秋轩. 基于长短时记忆网络的电商大数据同一性标定[J]. 计算机工程与科学, 2021, 43(03): 407-415.
[12]	雷国庆, 马驰远, 王永文, 郑重. 一种轻量级的处理器核性能分析框架[J]. 计算机工程与科学, 2021, 43(02): 199-204.
[13]	李琼, 宋振龙, 袁远, 谢徐超. 一种基于NVMeoF存储池的分域共享并发存储架构[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1711-1719.
[14]	蒋句平, 董德尊, 唐虹, 齐星云, 常俊胜, 庞征斌. 大规模高性能互连拓扑性能分析[J]. 计算机工程与科学, 2020, 42(10高性能专刊): 1730-1736.
[15]	陈旺虎，田真，张礼智，梁小燕，高雅琼. 基于插值的高维稀疏数据离群点检测方法[J]. 计算机工程与科学, 2020, 42(06): 966-972.