• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2007, Vol. 29 ›› Issue (7): 64-66.

• 论文 • 上一篇    下一篇

一种新的两阶段抽样算法

马光志 张耀坤   

  • 出版日期:2007-07-01 发布日期:2010-06-02

  • Online:2007-07-01 Published:2010-06-02

摘要:

两阶段抽样算法从海量数据集中抽取样本数据用于数据挖掘,当数据集规模过大时算法效率偏低,当数据集规模过大且为稀疏数据集时抽样精度偏低。本文改进了传统两阶段 抽样算法,提出新的抽样算法EAFAST,可自适应地调节算法参数,而且能充分利用历史信息进行启发式搜索。实验证明,EAFAST算法可同时提高算法效率和抽样精度,弥补了传统算法的不足。

关键词: 抽样 两阶段 频繁项目集 剪枝 精度

Abstract:

Traditional two-phase sampling algorithms extract the sample data used on data mining from a huge data set. The algorithm efficiency is lower when the  data set is oversized, and the sample accuracy is lower when the data set is an oversized sparse one. By improving the traditional two-phase sampling a   lgorithms, the paper presents a new sampling algorithm named EAFAST, which adjusts algorithm parameters adaptively and performs heuristic search using t  he historical information fully. Experiments demonstrate EAFAST can enhance the efficiency and sample accuracy simultaneously,and thus remedies the insu fficiencies of traditional algorithms.

Key words:  (sample, two-phase, frequent item set, trim, accuracy)