• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (08): 1433-1442.

• Computer Network and Znformation Security • Previous Articles     Next Articles

An Android malware detection method based on pre-trained language model

YIN Jie1,HUANG Xiao-yu1,LIU Jia-yin1,NIU Bo-wei2,XIE Wen-wei3,4    

  1. (1.Department of Computer Information and Network Security,Jiangsu Police Institute,Nanjing 210031;
    2.Cyber Security Guard Corps,Jiangsu Provincial Security Department,Nanjing 210024;
    3.Department of Network Security,Trend Micro Incorporated,Nanjing 210012;
    4.FOCUSLAB of Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
  • Received:2022-11-01 Revised:2023-01-06 Accepted:2023-08-25 Online:2023-08-25 Published:2023-08-18

Abstract: In recent years, supervised machine learning-based Android malware detection methods have made some progress. However, due to the difficulty in collecting malware samples, the size of labeled datasets is generally small, which leads to limited generalization ability of the trained supervised models. To address this problem, an unsupervised and supervised combined malware detection method is proposed. Firstly, a language model is pre-trained on a large amount of unlabeled APK samples using unsupervised methods to learn the rich and complex semantic relationships between different operators. Then, the pre-trained language model is fine-tuned by the labeled malware samples to realize the malware detecting ability. Experiments on datasets such as Drebin demonstrate that the proposed method has better generalization ability and detection performance compared with the baseline method, which achieves a maximum accuracy of 98.7%. 

Key words: Android, malware detection, pre-trained language model, unsupervised learning