一种高效中文文本聚类算法

J4 ›› 2013, Vol. 35 ›› Issue (2): 103-108.

一种高效中文文本聚类算法

马甲林,刘金岭,于长辉

(淮阴工学院计算机工程学院, 江苏淮安 223003)

收稿日期:2012-01-10 修回日期:2012-04-01 出版日期:2013-02-25 发布日期:2013-02-25
基金资助:
江苏省教育厅高校哲学社会科学项目（2012SJD870001）；淮安市科计划资助项目（SN1160）

An efficient algorithm for Chinese text clustering

MA Jialin,LIU Jinling,YU Changhui

(School of Computer Engineering,Huaiyin Institute of Technology,Huai’an 223003,China)

Received:2012-01-10 Revised:2012-04-01 Online:2013-02-25 Published:2013-02-25

摘要/Abstract

摘要：

文本聚类算法面临着文本向量高维和极度稀疏的问题，传统降维方法多数是在假设关键词相互独立的前提下，通过统计的方法进行特征提取，这种方法往往忽略了文本在上下文语境中的语义关系，导致文本语义大量丢失。利用《知网》知识库，通过计算语义类相似度，构建了带权值的多条词汇链，根据权值大小，从中选取权值最大和次大的前两个词汇链组成代表文本的关键词序列，在此基础上提出了基于主题词汇链的文本聚类算法—TCABTLC，不但可以解决文本向量高维和稀疏导致的聚类算法运行效率低的问题，而且得到了较好的聚类效果。实验表明，在保持较好准确率下，该聚类算法的时间效率得到了大幅度提高。

关键词: 知网, 向量模型, 词汇链, 文本聚类

Abstract:

Text clustering algorithm faces the extremely sparse highdimensional vector problem, the traditional dimension reduction methods statistically extract text features by assuming that the key words are independent. They often ignore the text semantic relations in the context, leading to considerable loss of text semantics. In this paper, using “HowNet”, by computing the similarity of the semantic class, a weighted value of the lexical chain is constructed. Depending on the size of the weights, the two lexical chains with two largest weights are chosen to be composed of representative text keyword sequence. Then, a text clustering algorithm based on the theme of lexical chain (TCABTLC) is proposed. It can solve the issue that the text vector with high dimension and sparse leads to the operating efficiency of the clustering algorithm, and obtain better clustering results. The experiments show that, to maintain good accuracy, the time efficiency of the clustering algorithm has been greatly improved.

Key words: HowNet;vector model;lexical chain;text clustering

马甲林,刘金岭,于长辉. 一种高效中文文本聚类算法[J]. J4, 2013, 35(2): 103-108.

MA Jialin,LIU Jinling,YU Changhui. An efficient algorithm for Chinese text clustering[J]. J4, 2013, 35(2): 103-108.

[1]	武国胜, 张月琴. 基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究[J]. 计算机工程与科学, 2020, 42(04): 722-732.
[2]	高永兵1,宋添树1,2,李江宇1,马占飞3. 基于知网的个人微博语义相关度的聚类研究[J]. 计算机工程与科学, 2019, 41(06): 1128-1135.
[3]	夏卓群1,2,3,罗君鹏1,2,胡珍珍1,2. 移动感知环境下基于CSA-SSVR的交通状态预测方法[J]. 计算机工程与科学, 2018, 40(08): 1482-1487.
[4]	马慧芳，朱志强，成玉丹，贾俊杰. 基于核心词项平均划分相似度的短文本聚类算法[J]. 计算机工程与科学, 2017, 39(08): 1562-1569.
[5]	陈功1，黄瑞章1，2，钟文良1. 基于社交特征的多维度文本表示方法[J]. 计算机工程与科学, 2016, 38(11): 2348-2355.
[6]	熊晶1，钟珞2，王爱民1,2. 甲骨文知识图谱构建中的实体关系发现研究[J]. J4, 2015, 37(11): 2188-2194.
[7]	吐尔地·托合提，艾海麦提江·阿布来提，米也塞·艾尼玩，艾斯卡尔·艾木都拉. 一种结合GAAC和Kmeans的维吾尔文文本聚类算法[J]. J4, 2013, 35(7): 149-155.
[8]	丁建立1,2,杨博1,2,雷雄3. 基于MapReduce的航空公司服务品质热点发现算法[J]. J4, 2013, 35(4): 130-135.
[9]	柳平增1，2，孟祥伟1，田盼3，邓振民1，王文山1，王玉存1，毕树生2. 基于物联网的精准农业信息感知系统设计[J]. J4, 2012, 34(3): 137-141.
[10]	程传鹏,吴志刚. 一种基于知网的句子相似度计算方法[J]. J4, 2012, 34(2): 172-175.
[11]	王振宇1，唐远华1，郭力2. 面向分层结构的网页分类与抓取[J]. J4, 2012, 34(11): 1-6.
[12]	金春霞,周海岩. 位置加权文本聚类算法[J]. J4, 2011, 33(6): 154-158.
[13]	景丽萍，恽佳丽，于剑. 领域知识在文本聚类应用中的机遇和挑战[J]. J4, 2010, 32(6): 88-91.
[14]	刘晓勇. 基于最优适值保留的蚁群文本聚类算法[J]. J4, 2010, 32(5): 79-81.
[15]	张鼎兴[1] 徐明[1] 高俊文[2] 刘爱心[3]. 一种多属性目标监测的无线感知网络覆盖算法[J]. J4, 2008, 30(4): 98-100.