基于改进K-means算法的微博舆情分析研究

计算机工程与科学

基于改进K-means算法的微博舆情分析研究

谢修娟1 ，李香菊1，莫凌飞2

（1.东南大学成贤学院计算机工程系，江苏南京 210000;2.东南大学仪器科学与工程学院，江苏南京 210000)

收稿日期:2016-02-22 修回日期:2016-06-16 出版日期:2018-01-25 发布日期:2018-01-25
基金资助:
江苏高校哲学社会科学基金（2016SJD880186）;江苏省现代教育技术研究课题(2016-R-46509);“十二五”国家科技支撑计划（2013BAJ05B02-2）

Microblogging opinion analysis based

on an improved K-means algorithm

XIE Xiu-juan1，LI Xiang-ju1，MO Ling-fei2

（1.Department of Computer Engineering,Southeast University Chengxian College,Nanjing 210000;

2.School of Instrument Science and Engineering,Southeast University,Nanjing 210000,China）

Received:2016-02-22 Revised:2016-06-16 Online:2018-01-25 Published:2018-01-25

摘要/Abstract

摘要：

为避免初始聚类中心选取到孤立点容易导致聚类结果陷入局部最优的不足，提出一种基于密度的K-means(聚类算法)初始聚类中心选择方法。该方法首先计算每个数据对象与其它数据对象间的平均相似度，找出平均相似度高于某固定阈值的对象视作核心对象，再从核心对象中选取彼此间最不相似的作为初始聚类中心。通过自构建的新浪微博抓取工具，分别抓取不同类别的数千条数据，经过分词、预处理及权重计算后，用改进的K-means算法对其进行聚类分析，查准/全率较传统的K-means算法要稳定，聚类的平均时间也得到缩短。实验结果表明，改进后的算法在微博聚类中有更高的准确性和稳定性，有利于从大量的微博数据中发现热点舆情。

关键词: 微博, 聚类中心, K-means聚类算法, 密度

Abstract:

In order to avoid selecting isolated points as the initial clustering center which can cause clustering results to fall into local optimum, we propose a new K-means (clustering algorithm) initial clustering center selection method based on density. This algorithm firstly calculates the average similarity between each data object and the others, and finds the core objects whose average similarities are higher than a fixed threshold. The least similar core object to each other is taken as the initial clustering center. We build a crawler for Sina Microblog to grab thousands of different types of data. After dividing words, pretreatment and weight calculation, we use the improved K-means algorithm for clustering analysis. Compared with the traditional K-means algorithm, our proposal has a more stable precision/full ratio, and the average clustering time is also shortened. Experimental results show that the improved algorithm has higher accuracy and better stability in microblog clustering, and can be used in discovering public opinion from a large number of microblog data.

Key words: microblog, clustering center, K-means clustering algorithm, density

谢修娟1,李香菊1，莫凌飞2. 基于改进K-means算法的微博舆情分析研究[J]. 计算机工程与科学.

XIE Xiu-juan1，LI Xiang-ju1，MO Ling-fei2.

Microblogging opinion analysis based

on an improved K-means algorithm

[J]. Computer Engineering & Science.

[1]	卢建云, 邵俊明. 基于多层次密度中心图的聚类算法[J]. 计算机工程与科学, 2025, 47(2): 327-335.
[2]	武培成, 赵旭俊, 靳黎忠. 基于网格密度积叠的流数据异常检测[J]. 计算机工程与科学, 2025, 47(1): 75-85.
[3]	俞丁翠, 罗龙飞, 宋云鹏, 李文通, 石亮. 面向高密度闪存的内存页大小探索[J]. 计算机工程与科学, 2024, 46(7): 1167-1174.
[4]	于勤, 吴非, 张猛, 谢长生. 全息存储中的纠错码研究综述[J]. 计算机工程与科学, 2024, 46(4): 571-579.
[5]	钟卓辉, 陈黎飞, . 基于模型的非凸聚类算法[J]. 计算机工程与科学, 2024, 46(2): 292-302.
[6]	赵佳彬, 徐慧英, 朱蓉, 陈滨, 王晓琳, 朱信忠. 基于多尺度特征融合与背景抑制的MFFBSNet人群计数算法[J]. 计算机工程与科学, 2024, 46(12): 2205-2214.
[7]	王若宾, 耿芳东, 张永梅, 宋威, 王伟锋, 徐琳. 基于改进自适应DBSCAN的混合式MOOC视频观看模式挖掘[J]. 计算机工程与科学, 2023, 45(9): 1670-1678.
[8]	陈彪, 陈才, 张坤, 叶琴. FCBGA封装的CPU芯片散热性能影响因素研究[J]. 计算机工程与科学, 2023, 45(3): 406-410.
[9]	李超, 涂国庆, . 高密度LoRa网络优化方法研究[J]. 计算机工程与科学, 2023, 45(3): 426-433.
[10]	李兰, 刘杰, 张洁. 基于YOLOv4改进算法的复杂行人检测模型研究[J]. 计算机工程与科学, 2022, 44(8): 1449-1456.
[11]	段玲, 郭军军, 余正涛, 相艳, . 基于正文和评论交互注意的微博案件方面识别[J]. 计算机工程与科学, 2022, 44(6): 1097-1104.
[12]	沈郭鑫, 蒋中云. 基于密度和中心指标的Canopy二分K-均值算法优化[J]. 计算机工程与科学, 2022, 44(2): 372-380.
[13]	王春东, 张卉, 莫秀良, 杨文军. 微博情感分析综述[J]. 计算机工程与科学, 2022, 44(1): 165-175.
[14]	程玉胜, 曹天成, 王一宾, 郑伟杰. 基于负相关性增强的不平衡多标签学习算法[J]. 计算机工程与科学, 2021, 43(9): 1700-1710.
[15]	武国胜, 张月琴. 基于LSA模型的改进密度峰值算法的微学习单元文本聚类研究[J]. 计算机工程与科学, 2020, 42(4): 722-732.