• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于改进K-means算法的微博舆情分析研究

谢修娟1 ,李香菊1,莫凌飞2   

  1. (1.东南大学成贤学院计算机工程系,江苏 南京 210000;2.东南大学仪器科学与工程学院,江苏  南京 210000)
  • 收稿日期:2016-02-22 修回日期:2016-06-16 出版日期:2018-01-25 发布日期:2018-01-25
  • 基金资助:

    江苏高校哲学社会科学基金(2016SJD880186);江苏省现代教育技术研究课题(2016-R-46509);“十二五”国家科技支撑计划(2013BAJ05B02-2)

Microblogging opinion analysis based
on an improved K-means algorithm

XIE Xiu-juan1,LI Xiang-ju1,MO Ling-fei2   

  1. (1.Department of Computer Engineering,Southeast University Chengxian College,Nanjing 210000;
    2.School of Instrument Science and Engineering,Southeast University,Nanjing 210000,China)
  • Received:2016-02-22 Revised:2016-06-16 Online:2018-01-25 Published:2018-01-25

摘要:

为避免初始聚类中心选取到孤立点容易导致聚类结果陷入局部最优的不足,提出一种基于密度的K-means(聚类算法)初始聚类中心选择方法。该方法首先计算每个数据对象与其它数据对象间的平均相似度,找出平均相似度高于某固定阈值的对象视作核心对象,再从核心对象中选取彼此间最不相似的作为初始聚类中心。通过自构建的新浪微博抓取工具,分别抓取不同类别的数千条数据,经过分词、预处理及权重计算后,用改进的K-means算法对其进行聚类分析,查准/全率较传统的K-means算法要稳定,聚类的平均时间也得到缩短。实验结果表明,改进后的算法在微博聚类中有更高的准确性和稳定性,有利于从大量的微博数据中发现热点舆情。
 

关键词: 微博, 聚类中心, K-means聚类算法, 密度

Abstract:

In order to avoid selecting isolated points as the initial clustering center which can cause clustering results to fall into local optimum, we propose a new K-means (clustering algorithm) initial clustering center selection method based on density. This algorithm firstly calculates the average similarity between each data object and the others, and finds the core objects whose average similarities are higher than a fixed threshold. The least similar core object to each other is taken as the initial clustering center. We build a crawler for Sina Microblog to grab thousands of different types of data. After dividing words, pretreatment and weight calculation, we use the improved K-means algorithm for clustering analysis. Compared with the traditional K-means algorithm, our proposal has a more stable precision/full ratio, and the average clustering time is also shortened. Experimental results show that the improved algorithm has higher accuracy and better stability in microblog clustering, and can be used in discovering public opinion from a large number of microblog data.
 
 

Key words: microblog, clustering center, K-means clustering algorithm, density