• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (7): 95-98.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • 上一篇    下一篇

一种针对非平衡数据的贝叶斯分类算法

汪春亮1,2,伏玉琛2   

  1. (1.苏州大学附属第二医院,江苏 苏州 215004;2.苏州大学计算机科学与技术学院,江苏 苏州 215006)
  • 收稿日期:2009-03-13 修回日期:2009-08-26 出版日期:2010-06-25 发布日期:2010-06-25
  • 通讯作者: 汪春亮 E-mail:c.l.wang2008@163.com
  • 作者简介:汪春亮(1979),男,安徽铜陵人,硕士生,研究方向为网络信息技术、软件设计与应用等;伏玉琛,副教授,研究方向为管理信息系统、电子政务与电子商务、数据挖掘与商务智能、地理信息系统等。

A New Bayesian Classification Algorithmfor NonBalance Datasets

WANG Chunliang1,2,FU Yuchen2   

  1. (1.No.2 Hospital Affiliated to Suzhou University,Suzhou 215004;
    (2.School of Computer Science and Technology,Suzhou University,Suzhou 215006,China)
  • Received:2009-03-13 Revised:2009-08-26 Online:2010-06-25 Published:2010-06-25
  • Contact: WANG Chunliang E-mail:c.l.wang2008@163.com

摘要:

借鉴半监督分类的思想,本文提出一种基于改进EM算法的贝叶斯分类模型,对移动通信网络中存在的大量随机缺失的非平衡数据进行分类。首先,从实际数据中经过初步统计分析得到能在一定程度上反应变量状态的先验概率,并以此作为贝叶斯分类模型的初始值进行EM迭代训练,从而减少EM算法的迭代次数并改善EM算法对初始值的敏感性以及局部收敛的缺陷;然后,利用对历史移动通信数据进行训练得到的叶斯网络分类模型,对测试数据进行预测分类。实验结果表明,该方法大大提高了移动通信数据中负类样本的预测成功率,与传统的数理统计分析方法相比较,表现出了更好的性能。

关键词: 半监督学习, 贝叶斯网络, EM 算法, 非平衡数据

Abstract:

Based on the idea of semisupervised learning, a new Bayesian classifier model by using an improved EM (ExpectationMaximum) algorithm is proposed to classify and predict nonbalance data gathered from mobile communication networks. Firstly, a statistical analysis is performed to calculate the priori probabilities based on the actual data. By using these priori probabilities as the initial values of the Bayesian model, we can speed up the convergence process of the EM algorithm. Secondly, a classifier based on the Bayesian network is constructed to learn the category characteristics of the historic communication data by improving the EM (ExpectationMaximum) steps. Thirdly, by using this classifier, the label of the current data sample is predicted. The experimental results demonstrate that, the proposed method highly increases the prediction accuracy of the negative label, and gains better performance than the traditional statistical methods.

Key words: semisupervised learning;Bayes