• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于Wasserstein GAN的文档表示模型

马永军,李亚军,汪睿,陈海山   

  1. (天津科技大学计算机科学与信息工程学院,天津 300457)
  • 收稿日期:2018-01-22 修回日期:2018-02-28 出版日期:2019-01-25 发布日期:2019-01-25
  • 基金资助:

    天津市科技计划项目(17KPXMSF00140);天津市教委社科重大项目(2017JWZD19)

A document representation model based on Wasserstein GAN

MA Yongjun,LI Yajun,WANG Rui,CHEN Haishan   

  1. (College of Computer Science and Information Engineering,Tianjin University of Science & Technology,Tianjin 300457,China)

     
  • Received:2018-01-22 Revised:2018-02-28 Online:2019-01-25 Published:2019-01-25

摘要:

文档表示模型可以将非结构化的文本数据转化为结构化数据,是多种自然语言处理任务的基础,而目前基于词的模型在文档表示任务中有着无法直接表示文档的缺陷。针对此问题,基于生成对抗网络GAN可以使用两个神经网络进行对抗学习,从而很好地学习到原始数据分布的特点,
提出了文档表示模型WADM,使用去噪自编码器作为其判别网络,由其隐层直接得到文档的分布表示。实验表明,WADM能够准确抽取文档特征,相比基于词的模型具有更强的文档表示能力。
 

关键词: 文档表示, 生成对抗网络, 去噪自编码器, 神经网络

Abstract:

Document representation models can convert unstructured text data into structured data, which is the basis of many natural language processing tasks. Currently, wordbased models cannot deal with unregistered words and documents in the document representation tasks. The generative adversarial network (GAN) can use two neural networks to deal with confrontation so as to learn the distribution of the original data well. We propose a Wasserstein adversarial document model (WADM), which uses denoising autoencoder as its discriminant network and obtains document representation directly by its hidden layer. Experiments show that the WADM can extract document features accurately and has stronger document representation capability than word-based models.
 

Key words: document representation, generative adversarial network(GAN), denoising autoencoder, neural network