社区问答服务中的问题分类任务研究
收稿日期: 2009-12-21
修回日期: 2010-04-17
网络出版日期: 2011-01-25
基金资助
国家科技重大专项基金资助项目(2009ZX0300400404)
A Study of the Question Classification Task in CommunityBased Q&A Services
Received date: 2009-12-21
Revised date: 2010-04-17
Online published: 2011-01-25
类似“百度知道”这类社区问答服务系统的主要任务之一是对问题进行分类,以便于对用户的提问进行组织。社区问答服务的实际应用需求对问题分类算法提出了高准确性、小计算量、对噪音数据敏感度低等要求。基于KullbackLeibler Distance的分类算法在大规模文本和高维向量分类任务中表现出较高的分类精度,本文在该分类算法的基础上,结合语言模型的思想,提出一种改进的分类算法:ngram KLD。通过在一个大尺度的问答对数据集合上进行的一系列实验,表明ngram KLD算法在问题分类任务中取得了优于传统算法的分类效果,并且在计算复杂度以及对噪声数据敏感度方面都较好地满足了问题分类任务的要求。
关键词: 短文本分类; KullbackLeibler Distance; 语言模型
王君泽,黄本雄,胡广,温杰 . 社区问答服务中的问题分类任务研究[J]. 计算机工程与科学, 2011 , 33(1) : 143 -149 . DOI: 10.3969/j.issn.1007130X.2011.
In Communitybased Q&A services(referred to as cQA) such as Baidu Zhidao, question classification is one of the crucial tasks and it is important to organize the questions submitted to the cQA system. The question categorization algorithm for the cQA service needs to get high accuracy, low computation and lowsensitivity to noise. Based on the kullbackLeibler distance classification algorithm, this paper introduces a new question classification approach adopting the idea of language model, named ngram KLD. The experimental results with a large corpus which contains more than 1 million questionanswer pairs show a significant improvement when the ngram KLD algorithm is used. And the ngram KLD algorithm is fit for the actual demand of the question classification task in the cQA service.
[1]Cao Y, Duan H, Lin CY, et al. Recommending Questions Using the MDLBased Tree Cut Model[C]∥Proc of the Int’l World Wide Web Conf,2008:8190.
[2]Jurczyk P, Agichtein E. Hits on Question Answer Portals: Exploration of Link Analysis for Author Ranking[C]∥Proc of the 30th Annual Int’l ACM SIGIR Conf Research and Development in Information Retrieval,2007:485846.
[3]Xue X, Jeon J,Croft W B. Retrieval Models for Question and Answer Archives[C]∥Proc of ACM SIGIR Conf Research and Development in Information Retrieval,2008:475482.
[4]Rocchio J J. Relevance Feedback in Information Retrieval[M].Prentice Hall, 1971.
[5]Yang Y, Liu X. A Reexamination of Text Categorization Methods[C]∥Proc of ACM SIGIR Conf Research and Development in Information Retrieval,1999:4249.
[6]Friedman N, Geiger D, Goldszmidt M. Bayesian Network Classifiers[J]. Machine Learning,1997, 29(23):131163.
[7]Joachims T. Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C]∥Proc of the 10th European Conf on Machine Learning,1998:137142.
[8]Bigi B. Using KullbackLeibler Distance for Text Categorization[C]∥Proc of the 25th European Conf on IR Research,2003:305319.
[9]Manning C D, Schütze H. Foundations of Statistical Natural Language Processing[M]. Cambridge, Massachusetts: The MIT Press,1999.
[10]Lafferty J, Zhai C. Document Language Models, Query Models, and Risk Minimization for Information Retrieval[C]∥Proc of ACM SIGIR Conf Research and Development in Information Retrieval,2001:111119.
/
| 〈 |
|
〉 |