• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2006, Vol. 28 ›› Issue (1): 122-124.

• 论文 • 上一篇    下一篇

中文词聚类研究

胡和平 曾庆锐 路松峰   

  • 出版日期:2006-01-01 发布日期:2010-05-20

  • Online:2006-01-01 Published:2010-05-20

摘要:

词聚类是语言自动处理中一个重要的基础环节。针对中文词聚类研究中训练数据缺乏、质量不高而影响聚类效果这一主要障碍,本文提出一种面向中文的词聚类算法,算法以词的上下文分布相似度作距离量度;然后分析了仪依据距离量度进行中文词聚类的缺陷,提出词的临近空间概念,并根据词的临近空间概念进行聚类,使得在不用指定类的数
数目与大小的情况下,依靠词的内在语义进行聚类;最后,算法再将聚类结果作为计算相似度的依据,进行EM迭代聚类,使聚类结果得到明显优化。实验证明,算法有效地克服了中文训练数据的数量和质量问题,聚类结果好。

关键词: 中文词 词 聚类 词的临近空间 EM算法

Abstract:

Word clustering is an important fundamental work of automatic language process. Point to dearth of training data and low quality of training data, whi  ch is the main obstacle of Chinese word clustering, a Chinese oriented algorithm is presented in this paper. First, the context similar degree of a word  is used as the distance measure of the word; second, the limitation of taking the distance measure only into account is analyzed; then, the concept of  Word-Near-Space is put for- ward, which can make word clustering work without allocating the total class number. Finally, according to the class which i s the result of clustering,we calculate the context similar degree, and repeat the above steps until the whole algorithm con- verges, so that it is cons istent with the EM criteriom Experiments show that the algorithm effectively conquers the two main obstacles of Chinese word clustering, and brings abou  t good clustering results.

Key words: Chinese word, clustering, Word-Near-Space, EM algorithm