• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (2): 103-108.

• 论文 • 上一篇    下一篇

一种高效中文文本聚类算法

马甲林,刘金岭,于长辉   

  1. (淮阴工学院计算机工程学院, 江苏 淮安 223003)
  • 收稿日期:2012-01-10 修回日期:2012-04-01 出版日期:2013-02-25 发布日期:2013-02-25
  • 基金资助:

    江苏省教育厅高校哲学社会科学项目(2012SJD870001);淮安市科计划资助项目(SN1160)

An efficient algorithm for Chinese text clustering

MA Jialin,LIU Jinling,YU Changhui   

  1. (School of Computer Engineering,Huaiyin Institute of Technology,Huai’an 223003,China)
  • Received:2012-01-10 Revised:2012-04-01 Online:2013-02-25 Published:2013-02-25

摘要:

文本聚类算法面临着文本向量高维和极度稀疏的问题,传统降维方法多数是在假设关键词相互独立的前提下,通过统计的方法进行特征提取,这种方法往往忽略了文本在上下文语境中的语义关系,导致文本语义大量丢失。利用《知网》知识库,通过计算语义类相似度,构建了带权值的多条词汇链,根据权值大小,从中选取权值最大和次大的前两个词汇链组成代表文本的关键词序列,在此基础上提出了基于主题词汇链的文本聚类算法—TCABTLC,不但可以解决文本向量高维和稀疏导致的聚类算法运行效率低的问题,而且得到了较好的聚类效果。实验表明,在保持较好准确率下,该聚类算法的时间效率得到了大幅度提高。

关键词: 知网, 向量模型, 词汇链, 文本聚类

Abstract:

Text clustering algorithm faces the extremely sparse highdimensional vector problem, the traditional dimension reduction methods statistically extract text features by assuming that the key words are independent. They often ignore the text semantic relations in the context, leading to considerable loss of text semantics. In this paper, using “HowNet”, by computing the similarity of the semantic class, a weighted value of the lexical chain is constructed. Depending on the size of the weights, the two lexical chains with two largest weights are chosen to be composed of representative text keyword sequence. Then, a text clustering algorithm based on the theme of lexical chain (TCABTLC) is proposed. It can solve the issue that the text vector with high dimension and sparse leads to the operating efficiency of the clustering algorithm, and obtain better clustering results. The experiments show that, to maintain good accuracy, the time efficiency of the clustering algorithm has been greatly improved.

Key words: HowNet;vector model;lexical chain;text clustering