基于改进相似度与类中心向量的半监督短文本聚类算法

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇

基于改进相似度与类中心向量的半监督短文本聚类算法

李晓红，冉宏艳，龚继恒，颜丽，马慧芳

(西北师范大学计算机科学与工程学院，甘肃兰州 730070)

收稿日期:2017-05-24 修回日期:2017-09-06 出版日期:2018-09-25 发布日期:2018-09-25
基金资助:
国家自然科学基金（61163039）;甘肃省青年科技基金(1606RJYA269，145RJYA259);甘肃省高等学校科研项目(2015A008);西北师范大学青年教师科研能力提升计划项目(NWNULKQN145,NWNULKQN1620)

A semi-supervised short text clustering algorithm

based on improved similarity and class-center vector

LI Xiaohong,RAN Hongyan,GONG Jiheng,YAN Li,MA Huifang

(College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070,China )

Received:2017-05-24 Revised:2017-09-06 Online:2018-09-25 Published:2018-09-25

摘要/Abstract

摘要：

通过分析现有短文本聚类算法的缺陷，提出了一种基于改进相似度与类中心向量的半监督短文本聚类算法。首先，定义强类别区分度词，利用已加标数据的类别信息提取并构造强类别区分度词集合，并对基于初始特征的余弦相似度和基于强类别区分度词项的相似度进行有效融合，得到更加合理的改进的短文本相似度计算公式。然后，通过计算样本与类中心向量的相似度实现对未分类样本的正确划分，与此同时，更新加标数据集合、类中心向量，重新抽取强类别区分度词。重复这个过程，直到实现所有数据的类别划分。实验表明：与其他同类算法相比，本文算法在聚类准确性和时间效率上有了较大的改进。

关键词: 强类别区分度, 相似度, 类中心向量, 半监督聚类, 短文本

Abstract:

By analyzing the shortcomings of the existing short text clustering algorithms, a semisupervised short text clustering algorithm based on improved similarity and classcenter vector is proposed. Firstly, strong category differentiation word is defined, and the set of strong category differentiation words is constructed by using labeled data. Then, an effective short text similarity measurement method is designed by combining the similarity based on cosine theorem and the similarity based on strong category differentiation words. Secondly, the correct classification of the unclassified samples is achieved by calculating the similarity between the sample and the classcenter vector.At the same time,the labeled data set and the classcenter vector are updated, and the strong category differentiation words are extracted again. This process is repeated until all the data is divided into categories. Experiments show that, compared with other similar algorithms, the proposal can achieve both higher accuracy and better time efficiency.

Key words: strong category differentiation, similarity, class-center vector, semisupervised clustering, short text

李晓红，冉宏艳，龚继恒，颜丽，马慧芳. 基于改进相似度与类中心向量的半监督短文本聚类算法[J]. 计算机工程与科学.

LI Xiaohong,RAN Hongyan,GONG Jiheng,YAN Li,MA Huifang.

A semi-supervised short text clustering algorithm

based on improved similarity and class-center vector

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

308

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	308	0	0

来源	本网站	其他网站

次数	254	54
比例	82%	18%

摘要

129

最新录用	在线预览	正式出版

129	0	0

	来源	本网站

	次数	129
	比例	100%

[1]	徐捷, 邵玉斌, 杜庆治, 龙华, 马迪南. 结合混合特征提取与深度学习的长文本语义相似度计算[J]. 计算机工程与科学, 2024, 46(08): 1513-1520.
[2]	王钦晨, 段利国, 王君山, 张昊妍, 郜浩. 基于BERT字句向量与差异注意力的短文本语义匹配策略[J]. 计算机工程与科学, 2024, 46(07): 1321-1330.
[3]	贾康, 李晓楠, 李冠宇. 一种基于自适应结构感知池化图匹配的图相似度计算模型[J]. 计算机工程与科学, 2023, 45(11): 1999-2007.
[4]	钟昊, 陈卫东. 一般图中的最小概要表示集问题[J]. 计算机工程与科学, 2023, 45(01): 113-118.
[5]	袁野, 廖薇. 基于多重相关信息交互的文本相似度计算方法[J]. 计算机工程与科学, 2022, 44(07): 1313-1320.
[6]	陈健鹏, 陈剑, 佘祥荣, 水新莹, 陈刚. 混合神经网络模型与注意力机制的地址匹配算法[J]. 计算机工程与科学, 2022, 44(05): 901-909.
[7]	徐景秀, 张青. 改进小波软阈值函数在图像去噪中的研究应用[J]. 计算机工程与科学, 2022, 44(01): 92-101.
[8]	王信, 刘晓燕, 张开琦, 王星, 严馨. 基于变更事件驱动的微服务组合平台设计与实现[J]. 计算机工程与科学, 2021, 43(10): 1781-1788.
[9]	马慧芳, 胡东林, 刘宇航, 李志欣. 融合作者合作强度与研究兴趣的合作者推荐[J]. 计算机工程与科学, 2021, 43(10): 1864-1872.
[10]	杨德志, 柯显信, 余其超, 杨帮华. 基于RCNN的问题相似度计算方法[J]. 计算机工程与科学, 2021, 43(06): 1076-1080.
[11]	李晓红, 王闪闪, 马堉银, 马慧芳. 融合相似度图和随机游走模型的多标签短文本分类算法[J]. 计算机工程与科学, 2021, 43(06): 1081-1087.
[12]	刘亚波, 吴秋轩. 基于长短时记忆网络的电商大数据同一性标定[J]. 计算机工程与科学, 2021, 43(03): 407-415.
[13]	肖继海, 崔晓红, 陈俊杰. 节点属性和拓扑信息相结合的脑网络聚类模型[J]. 计算机工程与科学, 2020, 42(11): 2088-2095.
[14]	吴晓崇, 段跃兴, 张月琴, 闫雄. 基于CNN和深层语义匹配的中文实体链接模型[J]. 计算机工程与科学, 2020, 42(08): 1514-1520.
[15]	郭竞知, 刘玮, 徐龙龙, 陈灯 . Agent能力承诺协作的自适应图规划协议生成算法[J]. 计算机工程与科学, 2020, 42(07): 1208-1214.

基于改进相似度与类中心向量的半监督短文本聚类算法

A semi-supervised short text clustering algorithm

based on improved similarity and class-center vector

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价