基于R-list的Top-K高效用项集挖掘算法

计算机工程与科学

基于R-list的Top-K高效用项集挖掘算法

何登平1,2,3，何宗浩1,2

（1.重庆邮电大学通信与信息工程学院，重庆 400065;2.重庆邮电大学通信新技术应用研究中心,重庆 400065;
3.重庆信科设计有限公司,重庆 401121）

收稿日期:2018-10-18 修回日期:2018-11-30 出版日期:2019-07-25 发布日期:2019-07-25

A top-k high utility itemset mining algorithm based on R-list

HE Dengping1,2,3，HE Zonghao1,2

(1.School of Telecommunications and Information Engineering,
Chongqing University of Posts and Telecommunications,Chongqing 400065;

2.Research Center of New Telecommunication Technology Applications,
Chongqing University of Posts and Telecommunications,Chongqing 400065;

3.Chongqing Information Technology Designing Company Limited,Chongqing 401121,China)

Received:2018-10-18 Revised:2018-11-30 Online:2019-07-25 Published:2019-07-25

摘要/Abstract

摘要：

针对现有的一阶段TopK高效用项集挖掘算法挖掘过程中阈值提升慢，迭代时生成大量候选项集造成内存占用过多等问题，提出一种基于重用链表（Rlist）的TopK高效用挖掘算法RHUM。使用一种新的数据结构Rlist来存储并快速访问项集信息，无需第2次扫描数据库进行项集挖掘。该算法重用内存以保存候选集信息，结合改进的RSD阈值提升策略对数据进行预处理，期间采用更严格的剪枝参数在递归搜索的过程中同时计算多个项集的效用来缩小搜索空间。在不同类型数据集中的实验结果表明：RHUM算法在内存效率方面均优于其他一阶段算法，且在K值变化时能保持稳定。

关键词: 高效用项集；一阶段挖掘；重用链表；数据挖掘, Top-K

Abstract:

Aiming at the problem that the existing one-phase top-k high utility itemset mining algorithm is slow to raise the threshold and generates a large number of candidate sets, thus occupying too large memory space during the iteration, we propose a top-k high utility mining algorithm RHUM based on reused list (R-list). This algorithm uses a new data structure called R-list to store and quickly access itemset information without having to scan the database a second time for mining. It reuses the memory to save the information of candidate sets, and preprocesses data jointly with the improved RSD threshold increment strategy. During the recursive search process, stricter pruning parameters are used to calculate the effect of multiple item sets simultaneously to narrow the search space. Experimental results on different types of data sets show that the RHUM is superior to other onephase algorithms in memory efficiency and stable under the change of K value.

Key words: high utility item set, one-phase mining, R-list, data mining, top-K

何登平1,2,3，何宗浩1,2. 基于R-list的Top-K高效用项集挖掘算法[J]. 计算机工程与科学.

HE Dengping1,2,3，HE Zonghao1,2. A top-k high utility itemset mining algorithm based on R-list[J]. Computer Engineering & Science.

[1]	何登平1，2，3，何宗浩1,2，李培强1,2. 基于Spark的并行化高效用项集挖掘算法[J]. 计算机工程与科学, 2019, 41(10): 1723-1730.
[2]	张晓琳，郑春红，刘立新，吕庆. 高效的连续不确定XML数据Top-k查询算法[J]. J4, 2014, 36(06): 1101-1107.
[3]	甘亮1,2,于莉莉3,李润恒1,贾焰1,金鑫4. 一种基于逆支配点集的数据流Top-k计算方法[J]. J4, 2012, 34(6): 59-64.