• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (10): 1815-1824.

• 软件工程 • 上一篇    下一篇

基于图神经网络的代码抄袭检测方法

陈昌奉1,赵宏州1,周恺卿2   

  1. 1.吉首大学计算机科学与工程学院,湖南 吉首 416000;2.吉首大学通信与电子工程学院,湖南 吉首 416000)
  • 收稿日期:2023-05-03 修回日期:2023-12-25 接受日期:2024-09-25 出版日期:2024-10-25 发布日期:2024-10-29
  • 基金资助:
    国家自然科学基金(62266019);湖南省教育厅科学研究项目(21C0363)

Code plagiarism detection based on graph neural network

CHEN Chang-feng1,ZHAO Hong-zhou1,ZHOU Kai-qing2   

  1. (1.College of Computer Science and Engineering,Jishou University,Jishou  416000;
    2.School of Communication and Electronic Engineering,Jishou University,Jishou  416000,China)
  • Received:2023-05-03 Revised:2023-12-25 Accepted:2024-09-25 Online:2024-10-25 Published:2024-10-29

摘要: 随着数据开源的不断深化,代码抄袭成本降低,严重影响软件行业的健康发展。因此,针对现有抄袭检测方法无法深度挖掘源代码语义和结构信息导致语义抄袭检测效果不佳的问题,提出一种基于图神经网络的代码抄袭检测方法。该方法利用图神经网络对源代码包括语义和结构信息在内的特征进行有效表征,并利用图注意力网络进行特征强化,进一步利用神经张量网络得到不同源代码之间的相似向量。最后,利用全连接网络计算不同源代码之间的相似度。同时,加入dropout机制平衡神经元权重,优化模型设计,防止过拟合。为了验证所提方法的有效性,在OJ系统数据集上进行实验验证,并将此方法与当前流行的检测方法进行了对比。实验结果表明,所提方法具有更好的检测效果。

关键词: 代码抄袭检测, 深度语义和结构信息提取, 图神经网络, 图注意力网络, 特征强化

Abstract: As open-source data becomes increasingly accessible, the cost of code plagiarism has decreased, significantly impacting the healthy development of the software industry. Addressing the limitation of existing plagiarism detection methods, which struggle to deeply mine the semantic and structural information of source code, leading to suboptimal semantic plagiarism detection results, this paper introduces a graph neural network-based code plagiarism detection method. This method uses graph neural networks to effectively represent the characteristics of source code, including semantic and structural information, and employs graph attention networks to enhance these features. Furthermore, it utilizes neural tensor networks to obtain similarity vectors between different source codes. Finally, a fully connected network calculates the similarity between different source codes. Meanwhile, the dropout mechanism is incorporated to balance neuron weights, optimize model design, and prevent overfitting. To validate the effectiveness of the proposed method, experiments were conducted on an OJ system dataset, and the results were compared with those of current popular detection methods. The experimental results demonstrate that the proposed method achieves better performance.

Key words: code plagiarism detection, deep semantic and structural information extraction;graph neural network, graph attention network, feature enhancement