• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 图形与图像 • 上一篇    下一篇

基于集成学习的图像垃圾邮件过滤方法

赵俊生,候圣,王鑫宇,尹玉洁   

  1. (内蒙古工业大学信息工程学院, 内蒙古 呼和浩特 010080)
  • 收稿日期:2019-08-10 修回日期:2020-01-10 出版日期:2020-06-25 发布日期:2020-06-25
  • 基金资助:

    国家自然科学基金(61363052);内蒙古自治区自然科学基金(2015MS0614);内蒙古工业大学自然科学重点基金(ZD201416)

An image spam filtering method
based on integrated learning

ZHAO Jun-sheng,HOU Sheng,WANG Xin-yu,YIN Yu-jie   

  1. (College of Information Engineering,Inner Mongolia University of Technology,Hohhot 010080,China)
  • Received:2019-08-10 Revised:2020-01-10 Online:2020-06-25 Published:2020-06-25

摘要:

目前的图像垃圾邮件过滤技术,大都采用国际上通用的垃圾图像数据集作为训练集,与中国国内图像垃圾邮件的图像特点不一致,图像数据缺乏实时更新,且分类器单一,过滤效果难以保证。针对该问题,在建立国内垃圾邮件图像数据库的基础上,首先提取图像的颜色、纹理和形状特征,再经K-NN分类算法优选出HSV颜色直方图特征对不同分类器进行训练、测试和性能比较,提出将基于粗糙集的K-NN算法、Naive Bayes算法和SVM算法构成的3种基分类器相结合,并基于串行迭代提升的方法形成集成学习的强分类器。该方法可以实现对国内图像垃圾邮件的有效过滤,使图像垃圾邮件过滤的准确率和召回率同时得到提升,分别为97.3%和96.1%,误判率降低到了2.7%。

关键词: 图像垃圾邮件过滤, 图像分类, 集成学习, K-NN算法, HSV颜色直方图

Abstract:

Currently, majority of the image spam mail filtering technologies adopt a global common image spam mail data set as the training set. This data set lacks of updates and exhibits characteristics different from Chinese domestic image spam mails. In addition, it only employs only one type of classi- fier, which worsens the filtering performance. To address this issue, on the basis of constructing a domestic image spam mail database, the color, texture, and shape characteristics of images are extracted firstly. Then, the K-NN classification algorithm is used to select the HSV color histogram features for training, testing and performance comparison of different classifiers. A serial iterative improvement method integrating rough set-based K-NN, Naive Bayes, and SVM is proposed to form a strong integrated learning classifier, which can effectively filter domestic image spam mails. The accuracy and recall rate of image spam filtering can be improved to 97.3% and 96.1% respectively, and the false positive rate is reduced to 2.7%.
 

Key words: image spam filtering, image classification, integrated learning, K-NN algorithm, HSV color histogram