• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (05): 930-936.

• 论文 • 上一篇    下一篇

基于集成混合采样的软件缺陷预测研究

戴翔1,毛宇光1,2   

  1. (1.南京航空航天大学计算机科学与技术学院,江苏 南京 210016;
    2.南京大学计算机软件新技术国家重点实验室,江苏 南京 210093)
  • 收稿日期:2014-04-10 修回日期:2014-05-26 出版日期:2015-05-25 发布日期:2015-05-25
  • 基金资助:

    国家自然科学基金资助项目(41301407)

Research on software defect prediction based on
integrated sampling and ensemble learning 

DAI Xiang1,MAO Yuguang1,2   

  1. (1.College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016;
    2.State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210093,China)
  • Received:2014-04-10 Revised:2014-05-26 Online:2015-05-25 Published:2015-05-25

摘要:

对软件缺陷预测的不平衡问题进行了研究,提出了一种处理不平衡数据的采样方法,用来解决分类器因为样本集中的样本类别不平衡而造成分类器性能下降的问题。为了避免随机采样的盲目性,利用启发性的混合采样方法来平衡数据,针对少数类采用SMOTE过采样,对多数类采用KMeans聚类降采样,然后综合利用多个单分类器来进行投票集成预测分类。实验结果表明,混合采样与集成学习相结合的软件缺陷预测方法具有较好的分类效果,在获得较高的查全率的同时还能显著降低误报率。

关键词: 不平衡数据;SMOTE;KMeans;投票;集成学习

Abstract:

We study the class-imbalanced problem of software defect prediction and propose an integrated sampling method  for class-imbalanced data classification so as to enhance the classification ability.In order to avoid the blindness of random sampling,we utilize the integrated sampling method to balance datasets:using SMOTE for over-sampling minority class and KMeans clustering for down-sampling majority class.After obtaining a balanced dataset,we utilize multiple single classifiers to ensemble learning. Experimental results show that the software defect prediction algorithm,which combines integrated sampling and ensemble learning,has better classification performance,obtaining a higher true positive rate while significantly reducing the false alarm rate. 

Key words: unbalanced dataset;SMOTE;K-Means;vote;ensemble learning