• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (08): 1433-1442.

• 计算机网络与信息安全 • 上一篇    下一篇

基于预训练语言模型的安卓恶意软件检测方法

印杰1,黄肖宇1,刘家银1,牛博威2,谢文伟3,4   

  1. (1.江苏警官学院计算机信息与网络安全系,江苏 南京 210031;2.江苏省公安厅网络安全保卫总队,江苏 南京 210024;
    3.趋势科技股份有限公司网络安全部,江苏 南京 210012;4.南京邮电大学开放计算与普适感知前沿实验室,江苏 南京 210003) 
  • 收稿日期:2022-11-01 修回日期:2023-01-06 接受日期:2023-08-25 出版日期:2023-08-25 发布日期:2023-08-18
  • 基金资助:
    国家自然科学基金(62272203);浙江大学CAD&CG国家重点实验室开发课题(A2102);南京大学计算机软件新技术国家重点实验室开放基金(KFKT2020B19);江苏省高等学校自然科学基金(21KJD520003)

An Android malware detection method based on pre-trained language model

YIN Jie1,HUANG Xiao-yu1,LIU Jia-yin1,NIU Bo-wei2,XIE Wen-wei3,4    

  1. (1.Department of Computer Information and Network Security,Jiangsu Police Institute,Nanjing 210031;
    2.Cyber Security Guard Corps,Jiangsu Provincial Security Department,Nanjing 210024;
    3.Department of Network Security,Trend Micro Incorporated,Nanjing 210012;
    4.FOCUSLAB of Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
  • Received:2022-11-01 Revised:2023-01-06 Accepted:2023-08-25 Online:2023-08-25 Published:2023-08-18

摘要: 近年来,基于有监督机器学习的安卓恶意软件检测方法取得了一定进展。但是,由于恶意软件样本搜集困难,带标签的数据集规模一般较小,导致训练出的有监督模型泛化能力有限。针对这一问题,提出无监督和有监督相结合的恶意软件检测方法。首先,使用无监督方法预训练语言模型,从大量无标记APK样本中学习字节码中丰富、复杂的语义关系,提高模型的泛化能力。然后,利用有标记的恶意软件样本对语言模型进行微调,使其能更有效地检测恶意软件。在Drebin等实验数据集上的实验结果表明,相比基准方法,提出的方法泛化能力更好,检测性能更优,最高检测准确率达98.7%。

关键词: 安卓, 恶意软件检测, 预训练语言模型, 无监督学习

Abstract: In recent years, supervised machine learning-based Android malware detection methods have made some progress. However, due to the difficulty in collecting malware samples, the size of labeled datasets is generally small, which leads to limited generalization ability of the trained supervised models. To address this problem, an unsupervised and supervised combined malware detection method is proposed. Firstly, a language model is pre-trained on a large amount of unlabeled APK samples using unsupervised methods to learn the rich and complex semantic relationships between different operators. Then, the pre-trained language model is fine-tuned by the labeled malware samples to realize the malware detecting ability. Experiments on datasets such as Drebin demonstrate that the proposed method has better generalization ability and detection performance compared with the baseline method, which achieves a maximum accuracy of 98.7%. 

Key words: Android, malware detection, pre-trained language model, unsupervised learning