融合LSTM和LDA差异的新闻文本关键词抽取方法

计算机工程与科学

融合LSTM和LDA差异的新闻文本关键词抽取方法

宁珊1,2，严馨1,2，周枫1,2，王红斌1,2，张金鹏3

（1.昆明理工大学信息工程与自动化学院，云南昆明 650504；2.昆明理工大学云南省人工智能重点实验室，云南昆明 650500；

3.云南财经大学信息管理中心，云南昆明 650221）

收稿日期:2019-03-08 修回日期:2019-05-21 出版日期:2020-01-25 发布日期:2020-01-25
基金资助:
国家自然科学基金（61562049,61462055）

A news keyword extraction method

combining LSTM and LDA differences

NING Shan1,2，YAN Xin1,2,ZHOU Feng1,2，WANG Hong-bin1,2，ZHANG Jin-peng3

（1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500;

3.Center of Information Management,Yunnan University of Finance and Economics,Kunming 650221,China）

Received:2019-03-08 Revised:2019-05-21 Online:2020-01-25 Published:2020-01-25

摘要/Abstract

摘要：

针对语义信息对TextRank的影响，同时考虑新闻标题信息高度浓缩以及关键词的覆盖性与差异性的特点，提出一种新的融合LSTM和LDA差异的关键词抽取方法。首先对新闻文本进行预处理，得到候选关键词；其次通过LDA主题模型得到候选关键词的主题差异影响度；然后结合LSTM模型和word2vec模型计算候选关键词与标题的语义相关性影响度；最后将候选关键词节点按照主题差异影响度和语义相关性影响度进行非均匀转移，得到最终的候选关键词排序，抽取关键词。该方法融合了关键词的语义重要性、覆盖性以及差异性的不同属性。在搜狗全网新闻语料上的实验结果表明，该方法的抽取结果相比于传统方法在准确率和召回率上都有明显提升。

关键词: 关键词抽取, 新闻标题, TextRank算法, word2vec模型, LDA模型

Abstract:

Aiming at the influence of semantic information on TextRank, and considering both the high concentration of news headline information and the characteristics of coverage and difference of keywords, a news keyword extraction method is proposed, which combines LSTM and LDA differences. Firstly, the news text is preprocessed to obtain the candidate keywords. Secondly, the topic difference influence degree of the candidate keywords is obtained through the LDA topic model. Then, the LSTM model and the word2vec model are combined to calculate the semantic relevance between the candidate keywords and the title. Finally, according to the topic difference influence degree and the semantic relevance influence degree, the candidate keyword nodes are non-uniformly transferred to obtain the final candidate keyword ranking and extract the keywords. The proposed method combines the different attributes of keywords such as semantic importance, coverage and difference. The experimental results on the Sogou news corpus show that, compared with the traditional method, the proposed method significantly improves the accuracy and recall rate.

Key words: keyword extraction, news headline, TextRank algorithm, word2vec model, LDA model

中图分类号:

null

宁珊, 严馨, 周枫, 王红斌, 张金鹏. 融合LSTM和LDA差异的新闻文本关键词抽取方法[J]. 计算机工程与科学.

NING Shan, YAN Xin, ZHOU Feng, WANG Hong-bin, ZHANG Jin-peng.

A news keyword extraction method

combining LSTM and LDA differences

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

272

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	272	0	0

来源	本网站	其他网站

次数	228	44
比例	84%	16%

摘要

188

最新录用	在线预览	正式出版

188	0	0

	来源	本网站

	次数	188
	比例	100%

[1]	阎红灿, 李铂初, 谷建涛, . 一种基于共现关键词的TextRank文摘自动生成算法[J]. 计算机工程与科学, 2023, 45(11): 2060-2069.
[2]	马长林1，闵洁2，谢罗迪1. 基于领域识别的主题模型观点挖掘研究[J]. 计算机工程与科学, 2019, 41(07): 1297-1302.
[3]	闫蓉,高光来. 基于强度熵的中文关键词识别方法[J]. 计算机工程与科学, 2016, 38(11): 2356-2361.
[4]	马长林，谢罗迪，司琪，王梦. 基于情感从属和最大熵模型的细粒度观点挖掘[J]. J4, 2015, 37(10): 1952-1958.