A data sampling method
based on double decision tree

Computer Engineering & Science

Previous Articles Next Articles

A data sampling method

based on double decision tree

CHEN Li1，FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2

（1.School of Geosciences and InfoPhysics,Central South University,Changsha 410075;

2.School of Software,Central South University,Changsha 410075,China）

Received:2017-10-17 Revised:2018-03-20 Online:2019-01-25 Published:2019-01-25

Abstract

Abstract:

In data mining, a basic assumption is that the data distribution of training set samples are consistent with that of test set samples. But as data volumes increase, how to find out representative data in huge amounts of data becomes particularly difficult. By studying existing data selection methods, we find that it is difficult to evaluate their sampling effect because they are not integrated with the data mining tool, such as simple random sampling and progressive sampling. Due to contingency factors and uncertainty, it is difficult to guarantee the basic assumptions of data mining, which also makes the generalization error of the model larger. In order to solve these problems, we put forward a structured data sampling method based on double decision tree. Firstly, we generate a decision tree with the C4.5 algorithm, which is used to select appropriate data and data collection points in the data source. Then, we generate another decision tree to evaluate the quality of the selected data set and achieve data sampling of high efficiency and high quality. Experiments show that compared with random sampling, the accuracy of the model based on our sampling is improved obviously.

Key words: decision tree, data sampling, machine learning

CHEN Li1，FEI Hongxiao2,DING Hailun2,CHENG Lin2,ZHAI Jiyu2.

A data sampling method

based on double decision tree

[J]. Computer Engineering & Science.

[1]	WEN Xin, ZENG Tao, LI Chun-bo, XU Zi-chen. A switch method of model inference serving oriented to serverless computing [J]. Computer Engineering & Science, 2024, 46(07): 1210-1217.
[2]	DING Jian-ping, LI Wei-jun, LIU Xue-yang, CHEN Xu. A review of named entity recognition research [J]. Computer Engineering & Science, 2024, 46(07): 1296-1310.
[3]	HUANG Zhi-hui, XIAO Xiang-li, ZHANG Yu-shu, XUE Ming-fu. Copyright protection of open-sourced datasets based on invisible backdoor watermarking [J]. Computer Engineering & Science, 2024, 46(06): 1013-1021.
[4]	GAO Shan, LI Shi-jie, CAI Zhi-ping. A survey of Chinese text classification based on deep learning [J]. Computer Engineering & Science, 2024, 46(04): 684-692.
[5]	HUANG Peng-cheng, FENG Chao-chao, MA Chi-yuan, . Machine learning prediction of timing violation under unknown corners [J]. Computer Engineering & Science, 2024, 46(03): 395-399.
[6]	LI Yang, YIN Da-peng, MA Zi-qiang , YAO Zi-hao, WEI Liang-gen, . Cache side-channel attack detection combining decision tree and AdaBoost [J]. Computer Engineering & Science, 2024, 46(03): 440-452.
[7]	PENG Chang, LIU Qing-zhi, CHEN Chang-bo, . Loop permutation and auto-tuning under polyhedral model [J]. Computer Engineering & Science, 2023, 45(12): 2121-2134.
[8]	GUO Yi, HE Ting-nian, LI Ai-bin, MAO Jun-yu. A knowledge tracing model fusing GA-CART and Deep-IRT [J]. Computer Engineering & Science, 2023, 45(09): 1691-1700.
[9]	ZHAO Zhen-yu, YANG Tian-hao, JIANG Wen-cheng, ZHANG Shu-zheng. A machine learning-based fast calculation method of multi-voltage, multi-temperature and multi-parameter standard cell delay [J]. Computer Engineering & Science, 2023, 45(08): 1331-1338.
[10]	LI Xiao-ling, FANG Jian-bin, MA Jun, TAN Shuang, TAN Yu-song. Automated task allocation of sparse matrix computation based on supervised learning [J]. Computer Engineering & Science, 2023, 45(05): 782-789.
[11]	HU Yan-fang, XIONG Wen, GAO Wei. An online game user churn prediction method based on Spark platform [J]. Computer Engineering & Science, 2022, 44(10): 1730-1737.
[12]	TANG Yang-kun, XIAN Gang, YANG Wen-xiang, YU Jie, ZHANG Xiao-rong, WANG Yao-bin. Job failure prediction based on user behavior on supercomputers [J]. Computer Engineering & Science, 2022, 44(10): 1753-1761.
[13]	CHU Yang, XU Wen-long. Review of early classification of Alzheimers disease based on computer-aided diagnosis technology [J]. Computer Engineering & Science, 2022, 44(05): 879-893.
[14]	CUI Hong, ZHAO Shuang, ZHANG Guang-sheng, SU Jin-shu. A mobile proxy application traffic identification method based on machine learning [J]. Computer Engineering & Science, 2022, 44(04): 654-664.
[15]	LIU Guo-qiang, ZHAO Zhen-yu, ZHAO Chen-yu, HAN Ao, YANG Tian-hao. A PCB routing resistance calculation method based on machine learning [J]. Computer Engineering & Science, 2022, 44(03): 396-402.

A data sampling method

based on double decision tree

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments