• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2008, Vol. 30 ›› Issue (7): 30-32.

• 论文 • 上一篇    下一篇

基于PLSA模型的Web用户聚类算法研究

俞辉   

  • 出版日期:2008-07-01 发布日期:2010-05-22

  • Online:2008-07-01 Published:2010-05-22

摘要:

面对因特网上急剧增加的网页内容,通过对Web日志中的浏览记录进行聚类分析,可以改进信息搜索和个性化服务的效率。根据信息论理论,在会话一页面矩阵权值计算中考   虑局部和全局权值的贡献;利用PLSA将隐式变量Z对页面P的条件概率转换为隐式变量Z对会话S的条件概率,然后在聚类分析中以此作为相似度计算依据。聚类算法采用了基于距离的k—medoids算法,以进一步改善聚类精度。实验结果验证了该算法的有效性和局限性。

关键词: Web日志 Web用户 概率潜在语义分析 聚类

Abstract:

With the rapid increase of web pages on the Intemet, we can improve the efficiency of information searching and personalized services by performing a clustering analysis of the browsed records. Based on the information theory, the local weight and global weight are considered in the calculation of the weights in the session-page matrix. Based on the probabilistic latent semantic analysis, the conditional probability of the latent variable Z to page P  is transformed into the conditional probability of the latent variable Z to session S. And then the transformed results are used in similarity calculat ion. The k-medoids algorithm is adopted to further improve the clursting results. Experimental results verify the validity and limitation of this algori  thm.

Key words: web log, web user, PLSA, clustering