• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (06): 1083-1089.

• 图形与图像 • 上一篇    下一篇

一种基于谱归一化的两阶段堆叠结构生成对抗网络的文本生成图像模型

王霞,徐慧英,朱信忠   

  1. (浙江师范大学数学与计算机科学学院,浙江 金华 321004)

  • 收稿日期:2021-06-07 修回日期:2021-07-13 接受日期:2022-06-25 出版日期:2022-06-25 发布日期:2022-06-17
  • 基金资助:
    国家自然科学基金(61976196);浙江省万人计划“杰出人才”项目(2018R51001);浙江省自然科学基金(LZ22F030003)

A text-to-image model based on the two-phase stacked generative confrontation network with spectral normalization

WANG Xia,XU Hui-ying,ZHU Xin-zhong   

  1. (College of Mathematics and Computer Science,Zhejiang Normal University,Jinhua 321004,China)
  • Received:2021-06-07 Revised:2021-07-13 Accepted:2022-06-25 Online:2022-06-25 Published:2022-06-17

摘要: 文本生成图像是机器学习领域非常具有挑战性的任务,虽然目前已经有了很大突破,但仍然存在模型训练不稳定以及梯度消失等问题。针对这些不足,在堆叠生成对抗网络(StackGAN)基础上,提出一种结合谱归一化与感知损失函数的文本生成图像模型。首先,该模型将谱归一化运用到判别器网络中,将每层网络梯度限制在固定范围内,相对减缓判别器网络的收敛速度,从而提高网络训练的稳定性;其次,将感知损失函数添加到生成器网络中,增强文本语义与图像内容的一致性。使用Inception score评估所提模型生成图像的质量。实验结果表明,该模型与原始StackGAN相比,具有更好的稳定性且生成图像更加逼真。

关键词: 深度学习;生成对抗网络;文本生成图像;谱归一化, 感知损失函数

Abstract: Generating images from text is a challenge task in machine learning community. Although significant success has been achieved so far, problems such as unstable network training and disappear- ing gradients still exist. In response to the above shortcomings, based on the stacked generative confrontation network model (StackGAN), this paper proposes a text-to-image generation method that combines spectral normalization and perceptual loss function. Firstly, the network model applies spectral normalization to the discriminator, restricts the gradient of each layer of the network to a fixed range, slows down the convergence speed of the discriminator, and hence improves the stability of network training. Secondly, the perceptual loss function is added to the generator network to enhance the consistency between the text content and the generated image. The network model uses Inception scores to evaluate the quality of the generated images. The experimental results show that, compared with the original StackGAN, the network model has better stability and generates clearer images.

Key words: deep learning, generative adversarial network, text-to-image generation, spectral normalization, perceptual loss function