• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (10): 1971-1976.

• 论文 • 上一篇    下一篇

基于余弦距离的局部敏感哈希的KNN算法在中文文本上的快速分类

戴上平1,冯鹏1,刘盛英杰1,舒红2   

  1. (1.华中师范大学计算机学院,湖北 武汉 430079;2.测绘遥感信息工程国家重点实验室,湖北 武汉 430079)
  • 收稿日期:2015-07-25 修回日期:2015-09-21 出版日期:2015-10-25 发布日期:2015-10-25
  • 基金资助:

    武汉市政府资助项目(基于网格的社区宜居环境分析研究)

A fast KNN algorithm for Chinese text classification
based on the LSH of cosine distance   

DAI Shangping1,FENG Peng1,LIUSHEN  Yingjie1,SHU Hong2   

  1. (1.School of Computer Science,Central China Normal University,Wuhan 430079;2.National Key Laboratory of Surveying and Remote Sensing I nformation Engineering,Wuhan 430079,China)
  • Received:2015-07-25 Revised:2015-09-21 Online:2015-10-25 Published:2015-10-25

摘要:

文本分类是文本挖掘中最重要的研究内容之一。为了克服目前以距离衡量的近似分类算法在海量数据下耗费大量时间的缺陷,提出了结合基于余弦距离的局部敏感哈希的方式将KNN算法在TFIDF下对中文文本进行快速分类。同时结合文本数据的特性给出了不同的哈希函数级联方式分别进行实验。在实验过程采用了布尔向量的方式规避重复访问,使分类的结果在可以允许的范围内,分类速度比原始KNN提高了许多。

关键词: 文本分类, 局部敏感哈希, TF-IDF, KNN, 布尔向量

Abstract:

Text classification is one of the most important study spots in text mining.In order to overcome the drawback that the classification algorithm based on distance brings high query time cost,a KNearest Neighbors (KNN) algorithm based on the Locality Sensitive Hashing (LSH) of cosine distance under TFIDF is proposed,which can classify Chinese text quickly.Besides,by combing the properties of the text data,experiments with different parameters are carried out.In the experiments,boolean vectors are used to avoid the duplication calculation.Compared with the original KNN,our algorithm can increase the speed of classification in ensuring the accuracy.

Key words: text classification;Locality Sensitive Hashing (LSH);TF-IDF;KNN;boolean vector