• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (12): 2186-2196.

• Graphics and Images • Previous Articles     Next Articles

Image adversarial cascade generation via coupling word and sentence-level text features

BAI Zhi-yuan1,2,YANG Zhi-xiang1,2,LUAN Hong-kang1,2,SUN Yu-bao1,2   

  1. (1.School of Computer Science,Nanjing University of Information Science & Techology,Nanjing 210044;
    2.Jiangsu Key Laboratory of Big Data Analysis Technology,
    School of Computer Science,Nanjing University of Information Science & Techology,Nanjing 210044,China)
  • Received:2023-03-24 Revised:2023-06-07 Accepted:2023-12-25 Online:2023-12-25 Published:2023-12-14

Abstract: Text-to-image generation aims to generate realistic images from natural language descriptions, and is a cross-modal analysis task involving text and images. In view of the fact that the generative confrontation network has the advantages of realistic image generation and high efficiency, it has become the mainstream model for text generation image tasks. However, the current methods often divide text features into word-level and sentence-level training separately, and the text information is not fully utilized, which easily leads to the problem that the generated image does not match the text. In response to this problem, this paper proposes an image confrontation cascade generation model (Union-GAN) that couples word-level and sentence-level text features, and introduces a text-image joint perception module (Union-Block) in each image generation stage.  By combining channel affine transformation and cross-modal attention, it fully utilizes the word-level semantic and overall semantic information of the text to generate images that not only match the text semantic description but also maintain clear structures. Meanwhile, jointly optimizing the discriminator and adding spatial attention to the corresponding discriminator allows the supervisory signal from the text to prompt the generator to generate more relevant images. Compared with multiple current representative networks such as AttnGAN on the CUB-200-2011 dataset, experimental results show that the FID score of our Union-GAN is 13.67, an increase of 42.9% compared to AttnGAN, and the IS score is 4.52, an increase of 0.16.

Key words: text-to-image generation, generative adversarial network(GAN), multimodal task