• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (03): 470-477.

• 计算机网络与信息安全 • 上一篇    下一篇

基于集成学习双流神经网络的实时面部篡改视频检测模型

袁野1,2,3,黄丽清1,2,3,叶锋1,2,3,黄添强1,2,3,罗海峰1,2,3,徐超1,2,3   

  1. (1.福建师范大学计算机与网络空间安全学院,福建 福州 350117;
    2.数字福建大数据安全技术研究所,福建 福州 350117;
    3.福建省公共服务大数据挖掘与应用工程技术研究中心,福建 福州 350117)
  • 收稿日期:2022-10-27 修回日期:2022-12-25 接受日期:2023-03-25 出版日期:2023-03-25 发布日期:2023-03-22
  • 基金资助:
    国家自然科学基金(62072106);福建省自然科学基金(2020J01168,2022J01190,2022J01188);福建省高校产学合作项目(2021H6004);福建省中青年教师教育科研项目(JAT210053,JAT210051)

A real-time facial manipulation video detection model based on ensemble learning dual-stream neural network

YUAN Ye1,2,3,HUANG Li-qing1,2,3,YE Feng1,2,3,HUANG Tian-qiang1,2,3,LUO Hai-feng1,2,3,XU Chao1,2,3   

  1. (1.College of Computer and Cyber Security,Fujian Normal University,Fuzhou 350117;
    2.Digital Fujian Institute of Big Data Security Technology,Fuzhou 350117;
    3.Fujian Provincial Engineering Research Center of Big Data Analysis and Application,Fuzhou 350117,China)
  • Received:2022-10-27 Revised:2022-12-25 Accepted:2023-03-25 Online:2023-03-25 Published:2023-03-22

摘要: 恶意面部篡改对社会安全和稳定存在负面影响,对面部篡改后的视频图像进行准确的检测是一个十分重要的课题。为了解决视频检测模型实时性较差的问题,提出一种基于集成学习双流循环神经网络的面部篡改视频检测模型,并引入集成学习中的投票机制。首先,接收少量连续的序列帧,通过卷积神经网络进行空间特征的提取,同时引入中心差分卷积进行空间域的篡改伪影增强。然后,将连续的序列帧进行差分,以增强时间域上的篡改伪影,同时通过卷积神经网络进行时间特征的提取。随后,将空间域和时间域的双流特征向量进行拼接,通过循环神经网络进行特征提取。在循环神经网络特征提取过程中,逐帧的特征信息被保留下来作为后续辅助帧级分类器的输入,同时循环神经网络的最终输出作为视频级判别器的输入。最后,引入集成模型的投票机制整合多个辅助帧级判别器和视频级判别器的输出,并通过引入权重超参数γ来平衡辅助帧级判别器和视频级判别器的重要程度,帮助模型提高检测准确率。在FaceForensics++数据集上,与主流检测模型进行对比,所提模型平均准确率提升了0.4%和1.0%。同时,所提模型可以仅使用较少连续帧进行篡改检测,提高了模型的实时性。

关键词: Deepfake, 卷积神经网络, 循环神经网络, 投票机制, 中心差分卷积

Abstract: Malicious face manipulation has a negative impact on social security and stability, and it is a very important issue to accurately detect video images after face tampering. In order to solve the problem of poor real-time performance of video manipulation detection model, this paper proposes a face manipulation video detection model based on ensemble learning dual-stream recurrent neural network, and introduces the voting mechanism in ensemble learning. The model first receives a small number of consecutive sequence frames, extracts spatial features through a convolutional neural network, and introduces central differential convolution to enhance tampering artifacts in the spatial domain. The model then differentiates consecutive sequence frames to enhance tampering artifacts in the temporal domain, while temporal feature extraction is performed through a convolutional neural network. Then, the model splices the dual-stream feature vectors in the spatial domain and the time domain, and performs feature extraction through a recurrent neural network. During the feature extraction process of the recurrent neural network , the frame-by-frame feature information is retained as the input of the subsequent auxiliary frame-level classifier, while the final output of the recurrent neural network is used as the input of the video-level discriminator. Finally, the model introduces the voting mechanism of the integrated model to integrate the outputs of multiple auxiliary frame-level discriminators and video-level discriminators, and introduces a weight hyperparameter γ to balance the importance of the auxiliary frame-level discriminator and video-level discriminator, helping the model to improve detection accuracy. On the FaceForensics++  dataset, the experimental results show that the proposed model  improves the average accuracy by 0.4% and 1.0% compared with mainstream detection model. At the same time, the proposed model can only use fewer consecutive frames for manipulation detection, which improves the real-time performance of the model.

Key words: Deepfake, convolutional neural network, recurrent neural network, voting mechanism, central difference convolution