• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (11): 2060-2069.

• 人工智能与数据挖掘 • 上一篇    下一篇

一种基于共现关键词的TextRank文摘自动生成算法

阎红灿1,2,李铂初1,谷建涛1,2   

  1. (1.华北理工大学理学院,河北 唐山 063210;2.河北省数据科学与应用重点实验室,河北 唐山 063000)
  • 收稿日期:2021-11-09 修回日期:2022-07-29 接受日期:2023-11-25 出版日期:2023-11-25 发布日期:2023-11-16
  • 基金资助:
    教育部协同育人项目 (201902137008);河北省高等教育教学改革研究与实践项目(2020GJJG158)

A TextRank automatic summarization generation algorithm based on co-occurrence keywords

YAN Hong-can1,2,LI Bo-chu1,GU Jian-tao1,2    

  1. (1.College of Science,North China University of Science and Technology,Tangshan 063210;
    2.Key Laboratory of Data Science and Application of Hebei Province,Tangshan 063000,China)
  • Received:2021-11-09 Revised:2022-07-29 Accepted:2023-11-25 Online:2023-11-25 Published:2023-11-16

摘要: 传统TextRank算法在生成摘要时只考虑句子间的相似度,忽略了文章本身间的相似度,且生成的摘要往往包含重复的信息表达。为此,提出一种基于共现关键词的TextRank算法,用word2vec模型将文章表示为句向量,考虑到文章的类别,将该类文章的共现关键词作为参数参与句子权值的迭代计算,然后,通过句子长度、关键词数量等信息对迭代得到的句子权重加以修正。实验结果表明,所提算法能够提高生成摘要的全面性和准确性。同时,所提算法使用MMR对抽取得到的摘要进行去除冗余处理,改善了摘要的重复表达情况。

关键词: 自动摘要生成, TextRank, 共现关键词, MMR算法, word2vec模型

Abstract: The traditional TextRank algorithm only considers the similarity between sentences but neglects the similarity between articles themselves when generating summaries, and the generated summaries often contain repeated expressions of information. Therefore, a TextRank algorithm based on co-occurrence keywords is proposed. The article is represented as a sentence vector by word2Vc model. Considering the category of the article, the co-occurrence keywords of this kind of article are taken as parameters to participate in the iterative calculation of sentence weight. The sentence weight obtained by iteration is corrected by sentence length, keyword number and other information. The experimental results show that the proposed algorithm can improve the comprehensiveness and accuracy of the summary generation. At the same time, this algorithm uses MMR to remove the redundancy of abstracts, which improves the problem of repeated representation of abstracts. 

Key words: automatic summary generation, TextRank, co-occurrence keyword, MMR algorithm, word2vec model