• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (10): 1856-1863.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于DPC聚类重采样结合ELM的不平衡数据分类算法

董宏成1,2,文志云1,2,3,万玉辉1,2 ,晏飞扬1,2    

  1. (1.重庆邮电大学通信与信息工程学院,重庆 400065;2.重庆邮电大学通信新技术应用研究中心,重庆 400065;

    3.重庆信科设计有限公司,重庆 401121)
  • 收稿日期:2020-06-24 修回日期:2020-09-03 接受日期:2021-10-25 出版日期:2021-10-25 发布日期:2021-10-22
  • 作者简介:董宏成 (1969),男,湖北安陆人,博士,高级工程师,研究方向为大数据和数据分析。

An imbalanced data classification algorithm based on DPC clustering resampling combined with ELM

DONG Hong-cheng1,2,WEN Zhi-yun1,2,3,WAN Yu-hui1,2,YAN Fei-yang1,2#br#

#br#
  

  1. (1.School of Communication and Information Engineering,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    2.Research Center of New Telecommunication Technology,
    Chongqing University of Posts and Telecommunications,Chongqing 400065;

    3.Chongqing Information Technology Designing Co.,Ltd.,Chongqing 401121,China)

  • Received:2020-06-24 Revised:2020-09-03 Accepted:2021-10-25 Online:2021-10-25 Published:2021-10-22
  • About author:DONG Hong-cheng ,born in 1969,PhD,senior engineer,his research interests include big data, and data analysis.

摘要: 采样技术与ELM分类算法进行结合可提高少数类样本的分类精度,但现有的大多数结合ELM的采样方法并未考虑到样本的不平衡程度及样本内部的分布情况,采样技术过于单一,导致分类模型的效率低下,少数类样本的识别率不高。针对此问题,提出了一种基于DPC聚类的重采样技术结合ELM的不平衡数据分类算法,首先根据数据集的不平衡程度分2种情况构建一个混合采样模型来平衡数据集;然后在此模型上运用DPC聚类算法分别对多数类样本和少数类样本进行分析处理,解决数据中存在的类内不平衡和噪声问题,使得2类样本相对均衡;最后使用ELM分类算法对得到的数据集进行分类。实验结果表明,与同类型分类算法进行比较,所提算法的2个分类性能指标在实验数据集上都有明显提升。

关键词: 极限学习机, 不平衡数据分类, DPC聚类, 重采样

Abstract: The combination of sampling technology and ELM classification algorithm can improve the classification accuracy of a small number of samples, but most existing sampling methods that combine ELM do not take into account the imbalance of the sample and the distribution within the sample. The sampling technique is too single, resulting in low efficiency of the classification model and low recognition rate of a small number of samples. In order to solve this problem, this paper proposes an imba- lanced data classification algorithm based on DPC clustering resampling combined with ELM. First, a mixed sampling model is constructed to balance the data set in two cases according to the degree of imbalance of the data set. Secondly, the DPC clustering algorithm is used to analyze and deal with the majority and minority classes on this model respectively. It can solves the problem of intra-class imbalance and noise in the data, so that the two types of samples are relatively balanced. Finally, the obtained ba- lanced data sets are classified using the ELM classification algorithm. Compared with the same type of classification algorithm, the two classification performance indexes F-Measure and G-mean of the proposed algorithm are significantly improved on the experimental data set.


Key words: extreme learning machine, imbalanced data classification, DPC clustering, resampling