• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2007, Vol. 29 ›› Issue (9): 110-113.

• 论文 • 上一篇    下一篇

海量文本数据库中的高效并行频繁项集挖掘方法

王永恒 杨树强 贾焰   

  • 出版日期:2007-09-01 发布日期:2010-06-02

  • Online:2007-09-01 Published:2010-06-02

摘要:

针对大规模文本数据库中频繁项集挖掘的特殊要求,本文提出了一种新的并行挖掘算法parFIM。parFIM以一种简单的数据结构H-Struct为基础,对数据进行纵向划分从而实现 并行挖掘。算法同时考虑了去除短模式和减少重复模式。实验结果表明,parFIM能够很好地适用于大规模文本数据库中的频繁项集挖掘任务。

关键词: 文本挖掘 海量文本数据库 频繁项集 并行

Abstract:

Frequent itemset mining is a common and useful task in data mining. It is also important in text mining. But most of the current mining algorithms can not be used in very large text databases. In order to solve the special problems in frequent itemsets mining in very large text databases,we propose a  new parallel mining algorithm parFIM. Based on a simple data structure H-Struct, parFIM mines in parallel by partitioning data vertically. Removing short patterns and reducing duplicated patterns are also considered. Our experiment shows parFIM can suit the frequent itemset mining task well in very large text databases.

Key words:  (text mining, very large text database;frequent itemset, parallel)