• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (12): 130-135.

• 论文 • 上一篇    下一篇

Bad:基于最小描述长度的均衡离散化方法

黄〓东   

  1. (宜宾学院计算机与信息工程学院,四川 宜宾 644007)
  • 收稿日期:2011-06-18 修回日期:2011-09-26 出版日期:2011-12-24 发布日期:2011-12-25

Bad:A Balanced Discretization Algorithm Based on the Minimum Description Length

HUANG Dong   

  1. (School of Computer and Information Engineering,Yibin University,Yibin 644007,China)
  • Received:2011-06-18 Revised:2011-09-26 Online:2011-12-24 Published:2011-12-25

摘要:

连续数据离散化是数据挖掘分类方法中的重要预处理过程。本文提出一种基于最小描述长度原理的均衡离散化方法,该方法基于最小描述长度理论提出一种均衡的离散化函数,很好地衡量了离散区间与分类错误之间的关系。同时,基于均衡函数提出一种有效的启发式算法,寻找最佳的断点序列。仿真结果表明,在C5.0决策树和Naive贝叶斯分类器上,提出的算法有较好的分类学习能力。

关键词: 离散化, 数据挖掘, 最小描述长度, 均衡函数

Abstract:

Discretization of continuous data is an important preprocess of classification methods in data mining. This paper presents a balanced discretization algorithm based on the minimum description length principle. It well measures the relationship between the discretized interval and classification errors by proposing a balanced discretization function based on the minimum description length. The approach proposes an effective heuristic discretization algorithm with the aim to find the optimal breakpoint sequence. The simulation results demonstrate that the proposed algorithm achieves more classification and learning ability on the C5.0 decision tree and the naive Bayesian classifier.

Key words: discretization;data mining;minimum description length(MDL);balanced function