基于图神经网络的代码抄袭检测方法

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (10): 1815-1824.

基于图神经网络的代码抄袭检测方法

陈昌奉1，赵宏州1，周恺卿2

1.吉首大学计算机科学与工程学院,湖南吉首 416000;2.吉首大学通信与电子工程学院,湖南吉首 416000）

收稿日期:2023-05-03 修回日期:2023-12-25 接受日期:2024-09-25 出版日期:2024-10-25 发布日期:2024-10-29
基金资助:
国家自然科学基金（62266019）；湖南省教育厅科学研究项目(21C0363)

Code plagiarism detection based on graph neural network

CHEN Chang-feng1,ZHAO Hong-zhou1,ZHOU Kai-qing2

(1.College of Computer Science and Engineering,Jishou University,Jishou 416000;
2.School of Communication and Electronic Engineering,Jishou University,Jishou 416000,China)

Received:2023-05-03 Revised:2023-12-25 Accepted:2024-09-25 Online:2024-10-25 Published:2024-10-29

摘要/Abstract

摘要： 随着数据开源的不断深化，代码抄袭成本降低，严重影响软件行业的健康发展。因此，针对现有抄袭检测方法无法深度挖掘源代码语义和结构信息导致语义抄袭检测效果不佳的问题，提出一种基于图神经网络的代码抄袭检测方法。该方法利用图神经网络对源代码包括语义和结构信息在内的特征进行有效表征，并利用图注意力网络进行特征强化，进一步利用神经张量网络得到不同源代码之间的相似向量。最后，利用全连接网络计算不同源代码之间的相似度。同时，加入dropout机制平衡神经元权重，优化模型设计，防止过拟合。为了验证所提方法的有效性，在OJ系统数据集上进行实验验证，并将此方法与当前流行的检测方法进行了对比。实验结果表明，所提方法具有更好的检测效果。

关键词: 代码抄袭检测, 深度语义和结构信息提取, 图神经网络, 图注意力网络, 特征强化

Abstract: As open-source data becomes increasingly accessible, the cost of code plagiarism has decreased, significantly impacting the healthy development of the software industry. Addressing the limitation of existing plagiarism detection methods, which struggle to deeply mine the semantic and structural information of source code, leading to suboptimal semantic plagiarism detection results, this paper introduces a graph neural network-based code plagiarism detection method. This method uses graph neural networks to effectively represent the characteristics of source code, including semantic and structural information, and employs graph attention networks to enhance these features. Furthermore, it utilizes neural tensor networks to obtain similarity vectors between different source codes. Finally, a fully connected network calculates the similarity between different source codes. Meanwhile, the dropout mechanism is incorporated to balance neuron weights, optimize model design, and prevent overfitting. To validate the effectiveness of the proposed method, experiments were conducted on an OJ system dataset, and the results were compared with those of current popular detection methods. The experimental results demonstrate that the proposed method achieves better performance.

Key words: code plagiarism detection, deep semantic and structural information extraction；graph neural network, graph attention network, feature enhancement

陈昌奉, 赵宏州, 周恺卿. 基于图神经网络的代码抄袭检测方法[J]. 计算机工程与科学, 2024, 46(10): 1815-1824.

CHEN Chang-feng, ZHAO Hong-zhou, ZHOU Kai-qing. Code plagiarism detection based on graph neural network[J]. Computer Engineering & Science, 2024, 46(10): 1815-1824.

[1]	陈子雄, 陈旭, 景永俊, 宋吉飞. 基于图神经网络的源代码漏洞检测研究综述[J]. 计算机工程与科学, 2024, 46(10): 1775-1792.
[2]	张悦, 张磊, 刘佰龙, 梁志贞, 张雪飞. 基于时空Transformer的多空间尺度交通预测模型[J]. 计算机工程与科学, 2024, 46(10): 1852-1863.
[3]	袁佳伟, 赵进. 基于图神经网络的OMCI模型相似性计算[J]. 计算机工程与科学, 2024, 46(09): 1576-1586.
[4]	吴斯琦, 赵清华, 于雨晨. 基于元学习的图神经网络冷启动推荐[J]. 计算机工程与科学, 2024, 46(09): 1675-1684.
[5]	王谢中, 陈旭, 景永俊, 王叔洋. 基于异构图神经网络的半监督网站主题分类[J]. 计算机工程与科学, 2024, 46(04): 635-646.
[6]	余天赐, 高尚. 融合多结构信息的代码注释生成模型[J]. 计算机工程与科学, 2024, 46(04): 667-675.
[7]	李清风, 金柳, 马慧芳, 张若一. 双视图对比学习引导的多行为推荐方法[J]. 计算机工程与科学, 2024, 46(04): 707-715.
[8]	马雪, 何星星, 兰咏琪, 李莹芳. 一阶逻辑中基于treelet图神经网络的前提选择[J]. 计算机工程与科学, 2024, 46(02): 374-380.
[9]	孙庆骁, 刘轶, 杨海龙, 王一晴, 贾婕, 栾钟治, 钱德沛. GNNSched：面向GPU的图神经网络推理任务调度框架[J]. 计算机工程与科学, 2024, 46(01): 1-11.
[10]	杨春霞, 马文文, 徐奔, 韩煜, . 融合标签信息的分层图注意力网络文本分类模型[J]. 计算机工程与科学, 2023, 45(11): 2018-2026.
[11]	周菊香, 周明涛, 甘健侯, 徐坚. 多阶段时序和语义信息增强的问题生成模型[J]. 计算机工程与科学, 2023, 45(10): 1847-1857.
[12]	杨春霞, 桂强, 马文文, 徐奔, . 融合图游走信息的图注意力网络方面级情感分析[J]. 计算机工程与科学, 2023, 45(10): 1858-1865.
[13]	高玮蔚, 刘杨, 马慧芳, 唐月晨. 基于增强偏好影响力的图注意力网络推荐算法[J]. 计算机工程与科学, 2023, 45(07): 1300-1307.
[14]	曹健, 陈怡梅, 李海生, 蔡强, . 基于图神经网络的行人轨迹预测研究综述[J]. 计算机工程与科学, 2023, 45(06): 1040-1053.
[15]	王扬, 陈智斌. 一种求解CVRP的动态图转换模型[J]. 计算机工程与科学, 2023, 45(05): 859-868.