• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于双决策树的数据采样方法

陈力1,费洪晓2,丁海伦2,成琳2,翟纪宇2   

  1. (1.中南大学地球科学与信息物理学院,湖南 长沙 410075;2.中南大学软件学院, 湖南 长沙 410075)
  • 收稿日期:2017-10-17 修回日期:2018-03-20 出版日期:2019-01-25 发布日期:2019-01-25
  • 基金资助:

    国家自然科学基金(61602525);中南大学2017年本科生自由探索项目(201710533267,ZY20170769)

A data sampling method
based on double decision tree

CHEN Li1,FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2   

  1. (1.School of Geosciences and InfoPhysics,Central South University,Changsha 410075;
    2.School of Software,Central South University,Changsha 410075,China)
  • Received:2017-10-17 Revised:2018-03-20 Online:2019-01-25 Published:2019-01-25

摘要:

在数据挖掘问题中,一个基本假设是训练集样本与测试集样本的数据分布一致,但随着数据量逐渐增加,如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现,传统的简单随机抽样和渐进抽样等数据选择方法,由于没有和数据挖掘工具进行结合,采样结果具有偶然性和不确定性,抽样数据很难保证数据挖掘的基本假设,这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题,提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树,借助决策树在数据源中选择适合的数据和数据采集点,同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明,与简单随机抽样相比,新采样数据下训练的模型准确率有明显提高。
 

关键词: 决策树, 数据采样, 机器学习

Abstract:

In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.
 

Key words: decision tree, data sampling, machine learning