• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (09): 1680-1689.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于近端策略优化与对抗学习的对话生成

蔡钺1,游进国1,2,丁家满1   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;2.云南省计算机技术应用重点实验室,云南 昆明 650500)

  • 收稿日期:2019-11-26 修回日期:2020-03-10 接受日期:2020-09-25 出版日期:2020-09-25 发布日期:2020-09-25
  • 基金资助:
    国家自然科学基金(61462050,61562054);云南省自然科学基金(KKSY201603016)

Proximal policy optimization and adversarial learning based dialog generation

CAI Yue1,YOU Jin-guo1,2,DING Jia-man1   

  1. (1.Faculty of Information Engineering and Automation,Kunming University of Science And Technology,Kunming 650500;

    2.Computer Technology Application Key Laboratory of Yunnan Province, Kunming 650500,China)

  • Received:2019-11-26 Revised:2020-03-10 Accepted:2020-09-25 Online:2020-09-25 Published:2020-09-25

摘要: 对话生成是自然语言处理的重点研究方向,对抗生成网络GAN最近在对话生成领域得到了较好的应用。为了进一步改善对话生成的质量,并且解决GAN训练过程中判别模型返回奖励重复利用率低从而导致模型训练效率低的问题,提出一种基于近端策略优化PPO的对话生成算法PPO_GAN。该算法通过GAN模型生成对话,通过判别模型区分生成的对话与真实的对话。并采用近端策略优化的方法训练GAN,能处理GAN在对话生成时导致的反向传播不可微分的情况,在保证生成模型单调非减训练的同时,通过限制生成模型迭代的梯度使判别模型得到的奖励可以重复利用。实验结果表明,对比于极大似然估计与Adver-REGS等对话生成算法,PPO_GAN算法提高了对话训练的效率并且改善了对话生成的质量。

关键词: 对话生成;近端策略优化;强化学习;对抗生成网络;序列到序列模型 

Abstract: Dialogue generation is the key research direction of natural language processing. Generative adversarial nets (GAN) have recently been well applied in the field of dialog generation. In order to further improve the quality of dialogue generation and solve the low efficiency of model training caused by the discriminative model return reward low utilization rate in the GAN training process, this paper proposes a dialogue generation algorithm (PPO_GAN) based on proximal policy optimization (PPO). 
The algorithm, via GAN, generates a dialogue through the generation model, and distinguishes between generated dialogue and real dialogue through the discriminant model. The GAN is trained by proximal policy optimization method, which can deal with the situation that the back propagation of GAN cannot be differentiated when the dialogue is generated. While ensuring the monotonic non-reduction training of the generated model, the rewards obtained by the discriminant model can be reused by limiting the gra- dient of the generated model iteration. The experimental results show that, compared with dialog gene- ration algorithm such as the maxinum likelihood estimation  and Adver-REGS, the PPO_GAN algorithm improves the efficiency of dialogue training and the quality of dialog generation.


Key words: dialog generation, proximal policy optimization (PPO), reinforcement learning, generative adversarial nets (GAN), sequence-to-sequence model