• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (11): 151-154.

• 论文 • 上一篇    下一篇

一种高效的基于排序二叉树的数据流挖掘算法

何昭青[1,2]   

  • 出版日期:2008-11-01 发布日期:2010-05-19

  • Online:2008-11-01 Published:2010-05-19

摘要:

数据流挖掘分类技术是数据挖掘领域非常具有挑战性的工作。VFDT利用Hoeffding不等式很好地解决了在数据流上进行单遍扫描获取高精度决策树的问题;VFDTc改进了V-FDT  ,使其能够处理连续属性。基于VFDT和VFDTc,我们设计并实现了一种基于排序二叉树的高效算法V-FDT-BSTree。该算法解决了VFDTc中存在的问题,提高了样本动态插入和最  佳划分节点选取的速度,从而提高了分类速度。实验结果表明,VFDT-BSTree在保持决策树大小和分类精度不变的基础上,执行时间相比VFDT平均减少32.25%,比VFDTc平均均减少24.96%。

关键词: 数据流 排序二叉树 连续属性

Abstract:

Data stream mining classification is a very challenging job in the field of data mining. VFDT is a one-pass algorithm for decision tree construction.    It uses the Hoeffding inequality to achieve a probabilistie bound on the accuracy of the tree constructed. VFDTc improves VFDT, and make it be able to p  rocess continuous attributes. Based on VFDT and VFDTc , we design and realize an efficient algorithm VFDT-BSTree based on binary search trees. The algor  ithm solves the problems existing in VFDTc, and increases the speeds of dynamic sample insertion and best split node selection, and thus improves the sp eed of classification. The experimental results show that VFDT-BSTree's time is 32. 25% less than that of VFDT, and 24. 96% less than that of VFDTc on  average, while the same tree size and accuracy are kept.

Key words: data streams, binary search tree, continuous attribute