• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

融合词语共现距离和类别信息的短文本特征提取方法

马慧芳1,2,邢玉莹1,王双1,张旭鹏1   

  1. (1.西北师范大学计算机科学与工程学院,甘肃 兰州 730070;
    2.桂林电子科技大学广西可信软件重点实验室,广西 桂林 541004)
  • 收稿日期:2017-01-03 修回日期:2017-05-26 出版日期:2018-09-25 发布日期:2018-09-25
  • 基金资助:

    国家自然科学基金(61762078,61363058);广西可信软件重点实验室研究课题(kx201705);2016年甘肃省大学生创新创业训练计划项目(201610736040,201610736041)

A short text feature extraction method combining
term co-occurrence distance and category information
 

MA Huifang1,2,XING Yuying1,WANG Shuang1,ZHANG Xupeng1   

  1. (1.College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;
    2.Gangxi Key Laboratory of Trused Software,Guilin University of Electronic Technology,Guilin 541004,China)

     
  • Received:2017-01-03 Revised:2017-05-26 Online:2018-09-25 Published:2018-09-25

摘要:

针对传统特征加权方法未充分考虑词语之间的语义信息和类别分布信息的不足,提出了一种融合词语共现距离和类别信息的短文本特征提取方法。一方面,将同一短文本中两个词语之间的间隔词数作为共现距离,计算它们之间的相关度。通过计算这两个词语共同出现的频率,得到每个词的关联权重;另一方面,利用改进的期望交叉熵计算某个词在某个类别中的权重值,将两者整合,得到某个类别中所有词的权重值。对所有类别中的词按权重值的大小进行降序排序,选取前K个词作为新的特征词项集合。实验表明,该方法能够有效提高短文本特征提取的效果。
 

关键词: 短文本, 共现距离, 期望交叉熵, 特征提取

Abstract:

Aiming at the problem that the traditional feature weighting methods do not fully consider the semantic information and category distribution information between terms, a short text feature extraction method combining term cooccurrence distance and category information is proposed. On the one hand, the number of terms between two terms in the same short text is taken as the cooccurrence distance, and the correlation weight between them is calculated. On the other hand, the improved expected cross entropy is used to calculate the weight value of a term in a certain category. They are integrated to obtain the weight value of all the terms in a certain category. The terms in all categories are sorted in descending order according to their weight values, and the top K terms are selected as the new feature term set. Experiments show that our method can improve the effect of short text feature extraction.

Key words: short text, co-occurrence distance, expected cross entropy, feature extraction