• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (7): 149-155.

• 论文 • 上一篇    下一篇

一种结合GAAC和Kmeans的维吾尔文文本聚类算法

吐尔地·托合提,艾海麦提江·阿布来提,米也塞·艾尼玩,艾斯卡尔·艾木都拉   

  1. (新疆大学信息科学与工程学院,新疆 乌鲁木齐 830046)
  • 收稿日期:2012-04-27 修回日期:2012-10-16 出版日期:2013-07-25 发布日期:2013-07-25
  • 基金资助:

    国家自然科学基金资助项目(61063022,61262062,61163033);新疆维吾尔自治区高技术研究发展计划项目(201212124);新疆维吾尔自治区高校科研计划重点项目(XJEDU2012I11);教育部新世纪优秀人才支持计划资助项目(NCET100969)

Combined algorithm of GAAC and
K-means for  Uyghur text clustering          

TURDI Tohti,AHMATJAN Ablat,MUYASSAR Aniwar,ASKAR Hamdulla   

  1. (School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)
  • Received:2012-04-27 Revised:2012-10-16 Online:2013-07-25 Published:2013-07-25

摘要:

介绍了K-means和GAAC聚类算法思想和两种特征提取方法对维吾尔文文本表示及聚类效率的影响。在较大规模文本语料库基础上,分别用K-means和GAAC的方法进行维吾尔文文本聚类实验及性能对比分析,针对经典K-means算法对初始聚类中心的过分依赖性及不稳定性缺点以及GAAC的高计算复杂性,提出了一种结合GACC和Kmeans的维吾尔文聚类算法。本算法分两步完成聚类操作,首先是GAAC模块从少量文本集中获取最优的初始类中心,然后是K-means模块对大量文本集进行快速聚类。实验结果表明,新算法在聚类准确率和时间复杂度上都有了显著的提高。

关键词: 维吾尔文, 文本聚类, K-means, GAAC, 结合算法

Abstract:

The paper introduced the K-means method and the GAAC clustering method and the impact of two feature extraction methods on Uyghur text representation and clustering efficiency. Based on the largescale text corpus, both the K-means method and the GAAC clustering method were used to carry out Uyghur text clustering experiments and do performance comparative analysis. In view of the shortcoming that the K-means method is over dependent on the initial cluster centers and instable as well as the high computational complexity of the GAAC method, this paper proposed a Uyghur text clustering algorithm combining the GAAC and the K-means methods. The proposed algorithm has two steps. Firstly, the optimal initial cluster center is obtained from the small amount of text set by the GAAC method. Secondly, the large amount of text set is fast clustered by the K-means method. Experimental results show that the proposed algorithm has a significant increase on the clustering accuracy and the time complexity.

Key words: Uyghur text;text clustering;Kmeans;GAAC;combined algorithm