• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

融合LSTM和LDA差异的新闻文本关键词抽取方法

宁珊1,2,严馨1,2,周枫1,2,王红斌1,2,张金鹏3   

  1. (1.昆明理工大学信息工程与自动化学院,云南 昆明 650504;2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650500;
    3.云南财经大学信息管理中心,云南 昆明 650221)
  • 收稿日期:2019-03-08 修回日期:2019-05-21 出版日期:2020-01-25 发布日期:2020-01-25
  • 基金资助:

    国家自然科学基金(61562049,61462055)

A news keyword extraction method
combining LSTM and LDA differences

NING Shan1,2,YAN Xin1,2,ZHOU Feng1,2,WANG Hong-bin1,2,ZHANG Jin-peng3   

  1.  (1.School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650504;
    2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500;
    3.Center of Information Management,Yunnan University of Finance and Economics,Kunming 650221,China)
     
  • Received:2019-03-08 Revised:2019-05-21 Online:2020-01-25 Published:2020-01-25

摘要:

针对语义信息对TextRank的影响,同时考虑新闻标题信息高度浓缩以及关键词的覆盖性与差异性的特点,提出一种新的融合LSTM和LDA差异的关键词抽取方法。首先对新闻文本进行预处理,得到候选关键词;其次通过LDA主题模型得到候选关键词的主题差异影响度;然后结合LSTM模型和word2vec模型计算候选关键词与标题的语义相关性影响度;最后将候选关键词节点按照主题差异影响度和语义相关性影响度进行非均匀转移,得到最终的候选关键词排序,抽取关键词。该方法融合了关键词的语义重要性、覆盖性以及差异性的不同属性。在搜狗全网新闻语料上的实验结果表明,该方法的抽取结果相比于传统方法在准确率和召回率上都有明显提升。

 

关键词: 关键词抽取, 新闻标题, TextRank算法, word2vec模型, LDA模型

Abstract:

Aiming at the influence of semantic information on TextRank, and considering both the high concentration of news headline information and the characteristics of coverage and difference of keywords, a news keyword extraction method is proposed, which combines LSTM and LDA differences. Firstly, the news text is preprocessed to obtain the candidate keywords. Secondly, the topic difference influence degree of the candidate keywords is obtained through the LDA topic model. Then, the LSTM model and the word2vec model are combined to calculate the semantic relevance between the candidate keywords and the title. Finally, according to the topic difference influence degree and the semantic relevance influence degree, the candidate keyword nodes are non-uniformly transferred to obtain the final candidate keyword ranking and extract the keywords. The proposed method combines the different attributes of keywords such as semantic importance, coverage and difference. The experimental results on the Sogou news corpus show that, compared with the traditional method, the proposed method significantly improves the accuracy and recall rate.
 

Key words: keyword extraction, news headline, TextRank algorithm, word2vec model, LDA model

中图分类号: