• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于中心向量的多级分类KNN算法研究

刘述昌,张忠林   

  1. (兰州交通大学电子与信息工程学院,甘肃 兰州 730070)
  • 收稿日期:2015-12-07 修回日期:2016-02-22 出版日期:2017-09-25 发布日期:2017-09-25
  • 基金资助:

    国家自然科学基金(61662043)

A multi-stage classification KNN
algorithm based on center vector
 

LIU Shu-chang,ZHANG Zhong-lin     

  1. (School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
  • Received:2015-12-07 Revised:2016-02-22 Online:2017-09-25 Published:2017-09-25

摘要:

针对KNN算法在中文文本分类时的两个不足:训练样本分布不均,分类时计算开销大的问题,在已有改进算法的基础上进行了更深入的研究,提出多级分类KNN算法。算法首先引入基于密度的思想对训练样本进行调整,通过样本裁减技术使样本分布更趋于理想的均匀状态,同时计算各类别的类中心向量。在保证类中心向量准确性的前提条件下,使分类阶段的复杂计算提前到分类器的训练过程中。最后一级选用合适的m值(预选类别个数),根据最近邻思想对待分类文本进行所属类别判定。实验结果表明,该算法在不损失分类精度的情况下,不仅降低了计算复杂度,而且显著提高了分类速度。
 

关键词: 文本分类, 多级分类器, 类中心向量, K最近邻

Abstract:

The KNN algorithm has two disadvantages when classifying Chinese texts: uneven distribution of training samples and high computation overhead. We conduct in-depth research on the basis of existing improved algorithms, and propose a multi-stage classification KNN algorithm. Firstly, the algorithm adjusts training samples according to the density, thus the sample distribution tends to be in more ideal uniform state by the sample cutting technology, and calculate the class center vectors of each class. Secondly, on the premise of the accuracy of class center vectors, we bring forward the complex calculations at the classification stage to the classifier training process. Finally, the algorithm uses the appropriate value of m (primary category number) to identify text category according to the nearest neighbor. Experimental results show that the proposed algorithm can not only reduce computation complexity, but also significantly improve the speed of classification without deteriorating classification accuracy.

Key words: text classification, multi-stage classifier, class center vector, K-nearest neighbor