• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (03): 470-477.

• Computer Network and Znformation Security • Previous Articles     Next Articles

A real-time facial manipulation video detection model based on ensemble learning dual-stream neural network

YUAN Ye1,2,3,HUANG Li-qing1,2,3,YE Feng1,2,3,HUANG Tian-qiang1,2,3,LUO Hai-feng1,2,3,XU Chao1,2,3   

  1. (1.College of Computer and Cyber Security,Fujian Normal University,Fuzhou 350117;
    2.Digital Fujian Institute of Big Data Security Technology,Fuzhou 350117;
    3.Fujian Provincial Engineering Research Center of Big Data Analysis and Application,Fuzhou 350117,China)
  • Received:2022-10-27 Revised:2022-12-25 Accepted:2023-03-25 Online:2023-03-25 Published:2023-03-22

Abstract: Malicious face manipulation has a negative impact on social security and stability, and it is a very important issue to accurately detect video images after face tampering. In order to solve the problem of poor real-time performance of video manipulation detection model, this paper proposes a face manipulation video detection model based on ensemble learning dual-stream recurrent neural network, and introduces the voting mechanism in ensemble learning. The model first receives a small number of consecutive sequence frames, extracts spatial features through a convolutional neural network, and introduces central differential convolution to enhance tampering artifacts in the spatial domain. The model then differentiates consecutive sequence frames to enhance tampering artifacts in the temporal domain, while temporal feature extraction is performed through a convolutional neural network. Then, the model splices the dual-stream feature vectors in the spatial domain and the time domain, and performs feature extraction through a recurrent neural network. During the feature extraction process of the recurrent neural network , the frame-by-frame feature information is retained as the input of the subsequent auxiliary frame-level classifier, while the final output of the recurrent neural network is used as the input of the video-level discriminator. Finally, the model introduces the voting mechanism of the integrated model to integrate the outputs of multiple auxiliary frame-level discriminators and video-level discriminators, and introduces a weight hyperparameter γ to balance the importance of the auxiliary frame-level discriminator and video-level discriminator, helping the model to improve detection accuracy. On the FaceForensics++  dataset, the experimental results show that the proposed model  improves the average accuracy by 0.4% and 1.0% compared with mainstream detection model. At the same time, the proposed model can only use fewer consecutive frames for manipulation detection, which improves the real-time performance of the model.

Key words: Deepfake, convolutional neural network, recurrent neural network, voting mechanism, central difference convolution