• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (11): 86-91.

• 论文 • 上一篇    下一篇

基于Dirichlet分布语言建模的信息检索技术研究

文健[1] 李舟军[2]   

  • 出版日期:2008-11-01 发布日期:2010-05-19

  • Online:2008-11-01 Published:2010-05-19

摘要:

基于多项式的一元语言模型不能表示文档中的突发(Burstiness)现象,而基于Diriehlet分布的语言模型能够较好地处理突发现象。本文分析和讨论了几种基于Dirichlet分 布的语言模型,并以ECM模型为基础,分别对文档和查询项进行语言建模,然后采用KL-divergence方法来度量文档模型和查询项模型的相似度。在TREC数据集上的实验表明,,与基本的模型相比较,采用EGM模型能够提高信息检索的平均精确度。

关键词: 突发现象 Diriehlet分布 DCM 信息检索

Abstract:

The word burstiness phenomenon is not taken into account in the unigram language model based on multinomial distribution, but the language model based on Dirichlet distribution can do it well. This paper analyses and discusses several language models based on Dirichlet distribution, moreover, the DCM   model is used as the basic model for document modeling and query modeling, the similarity between document and query is measured by KL-divergence. Exper imental results in the TREC data sets show that the DCM language model can improve the mean precision compared with the basic model.

Key words: burstiness;Dirichlet distributiond)CM~information retrieval