• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于类别特征改进的KNN短文本分类算法

黄贤英,熊李媛,刘英涛,李沁东   

  1. (重庆理工大学计算机科学与工程学院,重庆 400054)
     
  • 收稿日期:2016-05-04 修回日期:2016-06-23 出版日期:2018-01-25 发布日期:2018-01-25
  • 基金资助:

    国家自然科学基金(11547148);教育部人文社会科学研究青年基金(15YJC790061);重庆市教委科学技术研究项目(16SKGH133)

An improved KNN short text classification
algorithm based on category feature words
 

HUANG Xian-ying,XIONG Li-yuan,LIU Ying-tao,LI Qin-dong   

  1. (College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 400054,China)
  • Received:2016-05-04 Revised:2016-06-23 Online:2018-01-25 Published:2018-01-25

摘要:

KNN短文本分类算法通过扩充短文本内容提高短文本分类准确率,却导致短文本分类效率降低。鉴于此,通过卡方统计方法提取训练空间中各类别的类别特征,根据训练空间中各类别样本与该类别特征的相似情况,对已有的训练空间进行拆分细化,将训练空间中的每个类别细化为多个包含部分样本的训练子集;然后针对测试文本,从细化后的训练空间中提取与测试文本相似度较高的类别特征所对应的训练子集的样本来重构该测试文本的训练集合,减少KNN短文本分类算法比较文本对数,从而提高KNN短文本分类算法的效率。实验表明,与基于知网语义的KNN短文本分类算法相比,本算法提高KNN短文本分类算法效率近50%,分类的准确性也有一定的提升。
 

关键词:

Abstract:

The KNN classification algorithm improves the accuracy of short text classification by enlarging the content of short text. However, it leads to the decrease of classification efficiency on short text. Given this problem, we extract the category feature words in the categories of the training set by the CHI. According to the similarities between the samples of every classification and their features in the training set, the existing training set is split and refined. In this way, every classification of the training set can be split into many training subsets containing part of the samples. Then, according to the test text, the corresponding samples of the training subsets which are more similar to the test text are extracted to reconstruct the training sets of the test text. By decreasing the number of comparative text pairs in the KNN short text classification algorithm, the efficiency of the KNN short text classification algorithm can be increased. Experimental results show that comparing with the KNN short text classification algorithm based on HowNet, the efficiency of short text classification of the proposed algorithm can be increased by about 50 percent and the classification accuracy is also improved to some extent.
Key words:

Key words: short text classification;KNN classification;category feature;hownet;efficiency