• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (01): 86-94.

• 计算机网络与信息安全 • 上一篇    下一篇

恶意软件知识图谱的构建与研究

罗养霞,李浩,武晨明   

  1. (西安财经大学信息学院,陕西 西安 710100)
  • 收稿日期:2024-07-08 修回日期:2024-08-27 接受日期:2025-01-25 出版日期:2025-01-25 发布日期:2025-01-18
  • 基金资助:
    国家自然科学基金(62372373; 61972314);陕西省重点研发计划(2024GX-YBXM-545);西安财经大学2023年研究生创新基金(23YC033);2024年国家级大学生创新训练计划(202411560029)

Construction and research of malware knowledge graph

LUO Yangxia,LI Hao,WU Chenming   

  1. (School of Information,Xi’an University of Finance and Economics,Xi’an 710100,China)
  • Received:2024-07-08 Revised:2024-08-27 Accepted:2025-01-25 Online:2025-01-25 Published:2025-01-18

摘要: 近年来,知识图谱在恶意软件分析领域应用广泛,但是多数研究人员着重于构建恶意软件API知识图谱,利用知识图谱去检测恶意代码,而利用API知识图谱解释性较弱、专业性较高。针对上述问题,提出通过NER模型去抽取恶意软件名称、发现地等文本实体信息,以此构建恶意软件知识图谱,并通过知识图谱发现其多样性、演化路径、威胁方式与分类关联等。首先研究了恶意软件知识图谱的构建方法,完成数据预处理、模式层构建与数据层构建。其次对恶意软件结构化与半结构化数据进行实体标识与规范化,完成本体构建(实体、关联与附加属性),通过模式层指导数据层的方法,利于BERT-BiLSTM-CRF模型进行知识抽取。最后,利用Neo4j图数据库对知识图谱进行存储与可视化。利用病毒库数据对所建模型进行仿真验证,实验结果表明:此模型相比同类模型效果更好,性能指标更优异,对推进网络安全知识简易化和防御体系知识普及化具有重要意义。

关键词: 知识图谱, 恶意软件, 知识抽取

Abstract: In recent years, knowledge graphs have been widely applied in the field of malware analysis, but most scholars have focused on constructing malware API knowledge graphs and using them to detect malicious code. However, the interpretability of API knowledge graphs is relatively weak, and they require a high level of expertise. To address these issues, this paper proposes using a named entity recognition (NER) model to extract text entity information such as malware names and discovery locations, thereby constructing a malware knowledge graph. This graph is then used to discover the diversity, evolution paths, threat methods, and classification associations of malware. Firstly, this paper studies the construction method of a malware knowledge graph, completing data preprocessing, schema layer construction, and data layer construction. Secondly, it identifies and standardizes entities in structured and semi-structured malware data to complete ontology construction (entities, relationships, and additional attributes). Guided by the schema layer, the data layer uses the BERT-BiLSTM-CRF model for knowledge extraction. Finally, the Neo4j graph database is utilized for storing and visualizing the knowledge graph. Simultaneously, the proposed model is validated through simulations using virus database data. Experimental results show that this model outperforms similar models in terms of effectiveness and performance indicators, and it is of great significance for simplifying cybersecurity knowledge and promoting the popularization of defense system knowledge.


Key words: knowledge graph, malware, knowledge extraction