• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 软件工程 • 上一篇    下一篇

基于代价敏感支持向量机的软件缺陷预测研究

任胜兵,廖湘荡   

  1. (中南大学软件学院,湖南 长沙 410075)
  • 收稿日期:2017-11-08 修回日期:2018-01-24 出版日期:2018-10-25 发布日期:2018-10-25

Software defect prediction based on
cost-sensitive support vector machine
 

REN Shengbing,LIAO Xiangdang   

  1. (School of Software,Central South University,Changsha 410075,China)
  • Received:2017-11-08 Revised:2018-01-24 Online:2018-10-25 Published:2018-10-25

摘要:

软件缺陷预测是典型的非平衡学习问题。基于CSSVM和聚类算法改进代价敏感支持向量机(SVM)算法,提出了CCSSVM软件缺陷预测模型。在CCSSVM预测模型中,将SVM与类别误分代价结合起来,以非平衡数据评价指标作为目标函数,优化错分代价因子,提升少数类样本的识别率。通过聚类找到每类样本的中心点,根据样本到其中心点的距离定义每个样本的类别置信度,给每个样本分配不同的误分代价系数,并把样本的置信度引入到代价敏感SVM优化问题中,提高算法鲁棒性,提升SVM分类性能。此外,为了提高模型的泛化能力,使用遗传算法优化特征选择和模型参数。通过美国航空航天局NASA MDP数据集实验表明,本文方法的Gmean和Fmeasure模型评价值有明显的提升。

关键词: 软件缺陷预测, 代价敏感, 支持向量机, 非平衡数据分类, 参数选择, 遗传算法

Abstract:

Software defect prediction is a typical unbalanced learning problem. We propose a CCS-SVM software defect prediction model based on cost sensitive SVM algorithm improved by the CSSVM and clustering algorithm. In the CCSSVM prediction model, we combine SVM and the cost of class misclassification, take unbalanced data evaluation index as the objective function, and optimize the misclassification cost factor so as to enhance the recognition rate of the minority class samples. We find the center point of each sample through clustering, define the class confidence for each sample according to the distance of the sample to its center point, assign different misclassification cost factors to different samples, and introduce the class confidence of each sample to the optimization problem of cost sensitive SVM, and improve the robustness of the algorithm and classification performance of SVM. To enhance the generalization ability of the model, we use the genetic algorithm to optimize feature selection and model parameters. Experimental results of the NASA Metric Data Program (MDP) dataset show that our method is  significantly improved in the Gmean and Fmeasure value for model evaluation.
 

Key words: software defect prediction, cost sensitivity, support vector machine, unbanlanced data classification, parameter selection, genetic algorithm