• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (07): 1207-1215.

• 计算机网络与信息安全 • 上一篇    下一篇

一种基于聚类与噪声的网络流量分类方法

庞兴龙,朱国胜,杨少龙,李修远   

  1. (湖北大学计算机与信息工程学院,湖北 武汉 430062)
  • 收稿日期:2021-10-12 修回日期:2021-12-14 接受日期:2022-07-25 出版日期:2022-07-25 发布日期:2022-07-25

A network traffic classification method based on clustering and noise

PANG Xing-long,ZHU Guo-sheng,YANG Shao-long,LI Xiu-yuan   

  1. (School of Computer and Information Engineering,Hubei University,Wuhan 430062,China)
  • Received:2021-10-12 Revised:2021-12-14 Accepted:2022-07-25 Online:2022-07-25 Published:2022-07-25

摘要: 在标注现实网络流量数据的过程中难免会造成标签错误标记的情况,导致标签数据不可避免地受到噪声污染,即样本的观测标签与真实标签存在差异。为降低噪声标签对分类器分类准确率的负面影响,考虑引入噪声的2种情况,即正确标签类型错误标记和标签类型错误拼写,并提出一种基于标签噪声纠正的网络流量分类方法,该方法利用聚类和权重划分来对观测样本进行评估和修复。在2个网络流量数据集上的实验结果表明,与3种标签噪声修复算法STC、CC和ADE相比,提出的修复算法在不同噪声比例干扰下对最终的分类结果都有一定的提升。在NSL-KDD数据集上,标签平均修复率分别提高23.00%,7.58%和2.05%左右;在MOORE数据集上,标签平均修复率分别提高35.12%,10.40%和471%左右,在最终分类模型上有较好的分类稳定性。

关键词: 带噪标签, 网络流量分类, K-means聚类, 标签修复

Abstract: Because the real network traffic data inevitably cause wrong labeling in label labeling, the label data are inevitably polluted by noise, that is, the observed label of the sample is different from the real label. In order to reduce the negative impact of noise labels on the classification accuracy of the classifiers, this experiment considers two cases of wrong labeling: wrong labeling of correct label type and wrong spelling of label type. A network traffic classification method based on label noise correction is proposed. The method uses clustering and weight division to evaluate and repair the observation samples, and experiments are carried out on two network traffic datasets. The experimental results show that, compared with the three tag noise repair algorithms STC, CC and ADE, the proposed repair algorithm has a certain improvement on the final classification results under the interference of different noise proportions. On the NSL-KDD data set, the average tag correction rates are increased by 23.00%, 7.58% and 2.05% respectively; Similarly, on the MOORE data set, the average correction rates of tags are increased by 35.12%, 10.40% and 4.71% respectively. The proposal has good classification stability in the final classification model.

Key words: noisy label, network traffic classification, K-means clustering, label repair