基于双决策树的数据采样方法

计算机工程与科学

基于双决策树的数据采样方法

陈力1，费洪晓2，丁海伦2，成琳2，翟纪宇2

（1.中南大学地球科学与信息物理学院,湖南长沙 410075；2.中南大学软件学院，湖南长沙 410075）

收稿日期:2017-10-17 修回日期:2018-03-20 出版日期:2019-01-25 发布日期:2019-01-25
基金资助:
国家自然科学基金（61602525）；中南大学2017年本科生自由探索项目（201710533267,ZY20170769）

A data sampling method

based on double decision tree

CHEN Li1，FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2

（1.School of Geosciences and InfoPhysics,Central South University,Changsha 410075;

2.School of Software,Central South University,Changsha 410075,China）

Received:2017-10-17 Revised:2018-03-20 Online:2019-01-25 Published:2019-01-25

摘要/Abstract

摘要：

在数据挖掘问题中，一个基本假设是训练集样本与测试集样本的数据分布一致，但随着数据量逐渐增加，如何在海量数据中找出具有代表意义的数据也变得尤为困难。对现有的数据选择方法研究发现，传统的简单随机抽样和渐进抽样等数据选择方法，由于没有和数据挖掘工具进行结合，采样结果具有偶然性和不确定性，抽样数据很难保证数据挖掘的基本假设，这也使得最终模型的泛化误差较大。为了解决数据采样过程中类间的不平衡问题，提出一种基于双决策树的结构化数据采样方法。首先通过C4.5算法生成一棵决策树，借助决策树在数据源中选择适合的数据和数据采集点，同时通过使用另一棵决策树对选择出的数据集的质量进行评估来达到高效率和高质量的数据采样。实验表明，与简单随机抽样相比，新采样数据下训练的模型准确率有明显提高。

关键词: 决策树, 数据采样, 机器学习

Abstract:

In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.

Key words: decision tree, data sampling, machine learning

陈力1，费洪晓2，丁海伦2，成琳2，翟纪宇2. 基于双决策树的数据采样方法[J]. 计算机工程与科学.

CHEN Li1，FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2.

A data sampling method

based on double decision tree

[J]. Computer Engineering & Science.

[1]	温鑫, 曾焘, 李春波, 徐子晨. 面向服务器无感计算的模型推理服务切换方法研究[J]. 计算机工程与科学, 2024, 46(07): 1210-1217.
[2]	丁建平, 李卫军, 刘雪洋, 陈旭. 命名实体识别研究综述[J]. 计算机工程与科学, 2024, 46(07): 1296-1310.
[3]	黄智慧, 肖祥立, 张玉书, 薛明富. 基于隐形后门水印的开源数据集版权保护[J]. 计算机工程与科学, 2024, 46(06): 1013-1021.
[4]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.
[5]	黄鹏程, 冯超超, 马驰远, . 未知工艺角下时序违反的机器学习预测[J]. 计算机工程与科学, 2024, 46(03): 395-399.
[6]	李扬, 尹大鹏, 马自强, 姚梓豪, 魏良根, . 结合决策树和AdaBoost的缓存侧信道攻击检测[J]. 计算机工程与科学, 2024, 46(03): 440-452.
[7]	彭畅, 刘青枝, 陈长波, . 多面体模型下的循环置换与自动调优[J]. 计算机工程与科学, 2023, 45(12): 2121-2134.
[8]	郭艺, 何廷年, 李爱斌, 毛君宇. 融合GA-CART和Deep-IRT的知识追踪模型[J]. 计算机工程与科学, 2023, 45(09): 1691-1700.
[9]	赵振宇, 杨天豪, 蒋汶乘, 张书政. 基于机器学习的多压多温多参标准单元延迟快速计算方法[J]. 计算机工程与科学, 2023, 45(08): 1331-1338.
[10]	李小玲, 方建滨, 马俊, 谭霜, 谭郁松. 基于监督学习的稀疏矩阵自动任务分配[J]. 计算机工程与科学, 2023, 45(05): 782-789.
[11]	胡艳芳, 熊文, 高炜. 基于 Spark 平台的网络游戏用户流失预测方法[J]. 计算机工程与科学, 2022, 44(10): 1730-1737.
[12]	唐阳坤, 鲜港, 杨文祥, 喻杰, 张晓蓉, 王耀彬. 基于用户行为的超级计算机作业失败预测方法[J]. 计算机工程与科学, 2022, 44(10): 1753-1761.
[13]	楚阳, 徐文龙. 基于计算机辅助诊断技术的阿尔兹海默症早期分类研究综述[J]. 计算机工程与科学, 2022, 44(05): 879-893.
[14]	崔弘, 赵双, 张广胜, 苏金树. 基于机器学习的移动代理应用流量识别方法[J]. 计算机工程与科学, 2022, 44(04): 654-664.
[15]	刘国强, 赵振宇, 赵晨煜, 韩奥, 杨天豪. 基于机器学习的PCB布线电阻计算方法[J]. 计算机工程与科学, 2022, 44(03): 396-402.