• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A data sampling method
based on double decision tree

CHEN Li1,FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2   

  1. (1.School of Geosciences and InfoPhysics,Central South University,Changsha 410075;
    2.School of Software,Central South University,Changsha 410075,China)
  • Received:2017-10-17 Revised:2018-03-20 Online:2019-01-25 Published:2019-01-25

Abstract:

In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.
 

Key words: decision tree, data sampling, machine learning