一种结合GAAC和Kmeans的维吾尔文文本聚类算法

J4 ›› 2013, Vol. 35 ›› Issue (7): 149-155.

一种结合GAAC和Kmeans的维吾尔文文本聚类算法

吐尔地·托合提，艾海麦提江·阿布来提，米也塞·艾尼玩，艾斯卡尔·艾木都拉

(新疆大学信息科学与工程学院，新疆乌鲁木齐 830046)

收稿日期:2012-04-27 修回日期:2012-10-16 出版日期:2013-07-25 发布日期:2013-07-25
基金资助:
国家自然科学基金资助项目（61063022，61262062，61163033）；新疆维吾尔自治区高技术研究发展计划项目（201212124）；新疆维吾尔自治区高校科研计划重点项目（XJEDU2012I11）；教育部新世纪优秀人才支持计划资助项目（NCET100969）

Combined algorithm of GAAC and
K-means for Uyghur text clustering

TURDI Tohti，AHMATJAN Ablat，MUYASSAR Aniwar，ASKAR Hamdulla

(School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China)

Received:2012-04-27 Revised:2012-10-16 Online:2013-07-25 Published:2013-07-25

摘要/Abstract

摘要：

介绍了K-means和GAAC聚类算法思想和两种特征提取方法对维吾尔文文本表示及聚类效率的影响。在较大规模文本语料库基础上，分别用K-means和GAAC的方法进行维吾尔文文本聚类实验及性能对比分析，针对经典K-means算法对初始聚类中心的过分依赖性及不稳定性缺点以及GAAC的高计算复杂性，提出了一种结合GACC和Kmeans的维吾尔文聚类算法。本算法分两步完成聚类操作，首先是GAAC模块从少量文本集中获取最优的初始类中心，然后是K-means模块对大量文本集进行快速聚类。实验结果表明，新算法在聚类准确率和时间复杂度上都有了显著的提高。

关键词: 维吾尔文, 文本聚类, K-means, GAAC, 结合算法

Abstract:

The paper introduced the K-means method and the GAAC clustering method and the impact of two feature extraction methods on Uyghur text representation and clustering efficiency. Based on the largescale text corpus, both the K-means method and the GAAC clustering method were used to carry out Uyghur text clustering experiments and do performance comparative analysis. In view of the shortcoming that the K-means method is over dependent on the initial cluster centers and instable as well as the high computational complexity of the GAAC method, this paper proposed a Uyghur text clustering algorithm combining the GAAC and the K-means methods. The proposed algorithm has two steps. Firstly, the optimal initial cluster center is obtained from the small amount of text set by the GAAC method. Secondly, the large amount of text set is fast clustered by the K-means method. Experimental results show that the proposed algorithm has a significant increase on the clustering accuracy and the time complexity.

Key words: Uyghur text;text clustering;Kmeans;GAAC;combined algorithm

吐尔地·托合提，艾海麦提江·阿布来提，米也塞·艾尼玩，艾斯卡尔·艾木都拉. 一种结合GAAC和Kmeans的维吾尔文文本聚类算法[J]. J4, 2013, 35(7): 149-155.

TURDI Tohti，AHMATJAN Ablat，MUYASSAR Aniwar，ASKAR Hamdulla. Combined algorithm of GAAC and
K-means for Uyghur text clustering [J]. J4, 2013, 35(7): 149-155.

[1]	刘浩翰, 孙铖, 贺怀清, 惠康华. 基于改进YOLOv3的金属表面缺陷检测[J]. 计算机工程与科学, 2023, 45(07): 1226-1235.
[2]	李兰, 刘杰, 张洁. 基于YOLOv4改进算法的复杂行人检测模型研究[J]. 计算机工程与科学, 2022, 44(08): 1449-1456.
[3]	庞兴龙, 朱国胜, 杨少龙, 李修远. 一种基于聚类与噪声的网络流量分类方法[J]. 计算机工程与科学, 2022, 44(07): 1207-1215.
[4]	黄志强, 李军, 张世义. 基于轻量级神经网络的目标检测研究[J]. 计算机工程与科学, 2022, 44(07): 1265-1272.
[5]	苏小会, 张玉西, 徐淑萍, 尚煜. 改进K-means聚类算法行驶工况及油耗研究[J]. 计算机工程与科学, 2021, 43(11): 2020-2026.
[6]	谢挺, 刘瑞华, 魏正元. 一类连续的K-means 等价聚类模型及其优化算法[J]. 计算机工程与科学, 2021, 43(11): 2077-2083.
[7]	张瑾, 洪莉, 戴二壮. 求解带容量和时间窗约束车辆路径问题的改进蝙蝠算法[J]. 计算机工程与科学, 2021, 43(08): 1479-1487.
[8]	陈俊彦, 李玥, 梁楚欣, 雷晓春. SDN多控制器部署及流量均衡研究[J]. 计算机工程与科学, 2021, 43(05): 830-835.
[9]	方姣丽, 左克, 黄春, 刘杰, 李胜国, 卢凯. FD-LSTM:基于大规模系统日志的故障分析模型[J]. 计算机工程与科学, 2021, 43(01): 33-41.
[10]	武国胜, 张月琴. 基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究[J]. 计算机工程与科学, 2020, 42(04): 722-732.
[11]	刘梓璇，周建涛. 负载均衡的主导资源公平分配算法[J]. 计算机工程与科学, 2019, 41(09): 1574-1580.
[12]	周钢,郭福亮. 基于信息熵的集成学习过程多样性度量研究[J]. 计算机工程与科学, 2019, 41(09): 1700-1707.
[13]	周伟，肖杨 . 基于Canopy聚类的谱聚类算法[J]. 计算机工程与科学, 2019, 41(06): 1095-1100.
[14]	马琴，张兴忠，李海芳，邓红霞. 基于谱残差和聚类法的运动目标检测研究[J]. 计算机工程与科学, 2018, 40(10): 1867-1873.
[15]	王琳琳1，刘敬浩1，付晓梅2. 基于极限学习机与改进K-means算法的入侵检测方法[J]. 计算机工程与科学, 2018, 40(08): 1398-1404.