• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (03): 525-533.

• 人工智能与数据挖掘 • 上一篇    下一篇

代价敏感的KPCA-Stacking不均衡数据分类算法

曹婷婷,张忠林   

  1. (兰州交通大学电子与信息工程学院,甘肃 兰州 730070)
  • 收稿日期:2019-12-31 修回日期:2020-04-27 接受日期:2021-03-25 出版日期:2021-03-25 发布日期:2021-03-29
  • 基金资助:
    国家自然科学基金(61662043)

A cost-sensitive imbalanced data classification algorithm based on KPCA-Stacking

CAO Ting-ting,ZHANG Zhong-lin   

  1. (College of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
  • Received:2019-12-31 Revised:2020-04-27 Accepted:2021-03-25 Online:2021-03-25 Published:2021-03-29

摘要: 代价敏感学习是解决不均衡数据分类问题的一个重要策略,数据特征的非线性也给分类带来一定困难,针对此问题,结合代价敏感学习思想与核主成分分析KPCA提出一种代价敏感的Stacking集成算法KPCA-Stacking。首先对原始数据集采用自适应综合采样方法(ADASYN)进行过采样并进行KPCA降维处理;其次将KNN、LDA、SVM、RF按照贝叶斯风险最小化原理转化为代价敏感算法作为Stacking集成学习框架的初级学习器,逻辑回归作为元学习器。在5个公共数据集上对比J48决策树等10种算法,结果表明代价敏感的KPCA-Stacking算法在少数类识别率上有一定提升,比单个模型的整体分类性能更优。

关键词: 不均衡数据, 代价敏感, KPCA, Stacking, ADASYN过采样, 分类

Abstract: Cost-sensitive learning is an important strategy to solve the problem of imbalanced data classification. The non-linearity of data characteristics also brings some difficulties to classification. In view of this problem, by combining cost-sensitive learning with kernel principal component analysis (KPCA), this paper proposes a cost-sensitive Stacking integration algorithm called KPCA-Stacking. 
Firstly, the original data set is over-sampled by the adaptive synthetic sampling method (ADASYN) and KPCA dimensionality reduction is performed; Secondly, KNN, LDA, SVM, and RF are converted into cost-sensitive algorithms according to the Bayesian risk minimization principle as the primary learner in the Stacking integrated learning framework, and logistic regression is used as the meta-learner. Compa- rative experiments on 10 algorithms such as J48 decision tree in 5 public datasets show that the cost- sensitive KPCA-Stacking algorithm improves the recognition rate of a few classes to a certain extent, and is better than the overall classification performance of a single model.


Key words: imbalanced data, cost-sensitive, KPCA, Stacking, ADASYN oversampling, classification