• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (04): 654-664.

• 计算机网络与信息安全 • 上一篇    下一篇

基于机器学习的移动代理应用流量识别方法

崔弘1,赵双2,张广胜3,苏金树2   

  1. (1.烽火通信科技股份有限公司,湖北 武汉 430074;2.国防科技大学计算机学院,湖南 长沙 410073;
    3.中央军委政法委员会,北京 100080) 
  • 收稿日期:2020-08-16 修回日期:2020-12-14 接受日期:2022-04-25 出版日期:2022-04-25 发布日期:2022-04-20
  • 基金资助:

A mobile proxy application traffic identification method based on machine learning

CUI Hong1,ZHAO Shuang2,ZHANG Guang-sheng3,SU Jin-shu2   

  1. (1.FiberHome Telecommunication Technologies Co.,Ltd.,Wuhan 430074;
    2.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;
    3.Investigation Technology Center of PLA,Beijing 100080,China)
  • Received:2020-08-16 Revised:2020-12-14 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

摘要: 随着移动网络的迅速发展,越来越多的用户选择使用代理应用,以保护个人网络隐私,隐藏上网行为或绕开网络活动限制,给网络管理与审计带来了新的挑战。与此同时,恶意攻击者可利用代理应用隐藏身份,使得恶意行为更难以检测和防范。因此,代理应用流量识别对网络管理与安全具有重要的作用,但目前该问题并未得到充分的研究。由于代理应用流量通常经过加密或混淆处理,传统的流量识别技术无法被有效应用。为实现准确、快速的移动代理应用流量识别,提出一组与负载无关的流量特征,并首次加入TCP层option字段用于刻画流量。基于4种机器学习算法训练的分类器和2种流量识别对象,验证提出的特征对识别移动代理应用流量的有效性,并对各类特征的重要性进行分析。实验结果表明,提出的特征能有效识别代理应用流量。在识别流量是否经由代理时,基于随机森林的分类器可达到99%以上的整体准确率。识别流量所属代理应用时,整体准确率高于94%。在公开数据集ISCX VPN-nonVPN上与其他方法相比,提出的方法识别准确率更高,并具有更快的识别速度,适合实时流量识别场景。

关键词: 代理应用流量识别, 移动应用, 机器学习, 流量特征, 决策树

Abstract: With the rapid development of mobile networks, more users choose to protect privacy, hide online behavior and bypass the restrictions of networks by using proxy applications. As a result, new challenges are brought to network management and auditing. In addition, malicious attackers can use proxy to hide their identity, making it more difficult to detect and prevent such malicious behavior. Therefore, proxy application traffic identification plays an important role in network management and security, while this issue has not been fully studied at present. Because the proxy application traffic is usually encrypted and obfuscated, the traditional traffic identification methods can not be applied effectively. To achieve accurate and fast traffic identification of mobile proxy applications, a set of side- channel traffic features that are independent of the payload is proposed. The option field in the TCP header is used for the first time to describe the traffic characteristics. Four machine learning algorithms with two kinds of identification objects are utilized to validate the effectiveness and importance of the proposed feature set. The experimental results show that the proposed features can effectively identify proxy application traffic. More than 99% accuracy can be achieved when identifying whether traffic is forwarded by proxy applications based on random forest. Moreover, the average accuracy is higher than 94% when identifying which proxy application the traffic belongs to. Compared with other methods, the proposed method has better accuracy and faster classification speed on the public dataset ISCX VPN- nonVPN. Hence, it is more suitable for real-time traffic identification scenarios.

Key words: proxy application traffic identification, mobile application, machine learning, traffic feature, decision tree