• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles    

A semi-supervised short text clustering algorithm
based on improved similarity and class-center vector

LI Xiaohong,RAN Hongyan,GONG Jiheng,YAN Li,MA Huifang   

  1. (College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China )
  • Received:2017-05-24 Revised:2017-09-06 Online:2018-09-25 Published:2018-09-25

Abstract:

By analyzing the shortcomings of the existing short text clustering algorithms, a semisupervised short text clustering algorithm based on improved similarity and classcenter vector is proposed. Firstly, strong category differentiation word is defined, and the set of strong category differentiation words is constructed by using labeled data. Then, an effective short text similarity measurement method is designed by combining the similarity based on cosine theorem and the similarity based on strong category differentiation words. Secondly, the correct classification of the unclassified samples is achieved by calculating the similarity between the sample and the classcenter vector.At the same time,the labeled data set and the classcenter vector are updated, and the strong category differentiation words are extracted again. This process is repeated until all the data is divided into categories. Experiments show that, compared with other similar algorithms, the proposal can achieve both higher accuracy and better time efficiency.
 

Key words: strong category differentiation, similarity, class-center vector, semisupervised clustering, short text