• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

融合共现距离和区分度的短文本相似度计算方法

刘文1,马慧芳1,2,脱婷1,陈海波1   

  1. (1.西北师范大学计算机科学与工程学院,甘肃 兰州 730070
    2.桂林电子科技大学广西可信软件重点实验室,广西 桂林 541004)
  • 收稿日期:2016-12-20 修回日期:2017-02-28 出版日期:2018-07-25 发布日期:2018-07-25
  • 基金资助:

    国家自然科学基金(61762078,61363058);广西可信软件重点实验室研究课题(KX201705);
    西北师范大学学生创新能力计划(CX2018Y054)

Short text similarity measure based on
co-occurrence distance and discrimination

LIU Wen1,MA Huifang1,2,TUO Ting1,CHEN Haibo1   

  1. (1.College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;
    2.Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin 541004,China)
  • Received:2016-12-20 Revised:2017-02-28 Online:2018-07-25 Published:2018-07-25

摘要:

针对短文本内容简短、特征稀疏等特点,提出一种融合共现距离和区分度的短文本相似度计算方法。一方面,该方法在整个短文本语料库中利用两个共现词之间距离计算它们的共现距离相关度。另一方面通过计算共现区分度来提高距离相关度的准确度,然后对每个文本中词项进行相关性加权,最后通过词项的权重和词项之间的共现距离相关度计算两个文本的相似度。实验结果表明,本文提出的方法能够提高短文本相似度计算的准确率。
 

关键词: 短文本, 共现距离相关度, 共现区分度, 词项加权, 相似度计算

Abstract:

Aiming at the typical characteristics of severe sparseness and high dimension of short texts, we propose a short text similarity measure method based on cooccurrence distance and discrimination. On the one hand, the method leverages the cooccurrence distance between terms in each document to determine cooccurrence distance correlation. On the other hand, we calculate the cooccurrence discrimination to improve the accuracy of cooccurrence distance correlation, and then the relevance weight of the terms in the text is calculated. The text similarity between two short texts is calculated according to the term weights and the cooccurrence distance between terms. Experimental results show that the proposed method outperforms the baseline algorithm in term of performance and efficiency in similarity calculation.


 

Key words: short text, co-occurrence distance correlation, cooccurrence discrimination, term weighting, similarity calculation