• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (10): 1971-1976.

• 论文 • Previous Articles     Next Articles

A fast KNN algorithm for Chinese text classification
based on the LSH of cosine distance   

DAI Shangping1,FENG Peng1,LIUSHEN  Yingjie1,SHU Hong2   

  1. (1.School of Computer Science,Central China Normal University,Wuhan 430079;2.National Key Laboratory of Surveying and Remote Sensing I nformation Engineering,Wuhan 430079,China)
  • Received:2015-07-25 Revised:2015-09-21 Online:2015-10-25 Published:2015-10-25

Abstract:

Text classification is one of the most important study spots in text mining.In order to overcome the drawback that the classification algorithm based on distance brings high query time cost,a KNearest Neighbors (KNN) algorithm based on the Locality Sensitive Hashing (LSH) of cosine distance under TFIDF is proposed,which can classify Chinese text quickly.Besides,by combing the properties of the text data,experiments with different parameters are carried out.In the experiments,boolean vectors are used to avoid the duplication calculation.Compared with the original KNN,our algorithm can increase the speed of classification in ensuring the accuracy.

Key words: text classification;Locality Sensitive Hashing (LSH);TF-IDF;KNN;boolean vector