• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (4): 115-120.

• 论文 • 上一篇    下一篇

基于Hadoop的搜索引擎用户行为分析

王振宇1,郭力 2   

  1. (1.华南理工大学软件学院,广东 广州 510006;2.华南理工大学计算机科学与工程学院,广东 广州 510006)
  • 收稿日期:2010-03-28 修回日期:2010-07-15 出版日期:2011-04-25 发布日期:2011-04-25
  • 作者简介:王振宇(1967),男,河南许昌人,博士,教授,研究方向为分布式计算与SOA、操作系统与虚拟化技术、嵌入式系统和实时处理。郭力(1987),男,湖北仙桃人,硕士,研究方向为网络存储技术和分布式计算系统。
  • 基金资助:

    广东省科技计划资助项目(2007B01020049)

An Analysis of the Search Engine User Behaviors Based on Hadoop

WANG Zhenyu1,GUO Li2   

  1. (1.School of Software Engineering,South China University of Technology,Guangzhou 510006;(2.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,China)
  • Received:2010-03-28 Revised:2010-07-15 Online:2011-04-25 Published:2011-04-25

摘要:

搜索引擎用户行为分析是网络信息检索技术的研究热点。通过分析用户点击行为,利用Web数据挖掘技术获取有用信息,提高搜索引擎的检索算法和检索服务的效率,把用户从大量无序的搜索结果中解放出来。本文针对传统并行计算模型在易扩展和易编程方面遇到的瓶颈,给出一种基于Hadoop的海量日志数据处理模型,通过基于Hadoop的分布式文件系统HDFS与MapReduce并行计算模型提高系统扩展性和易编程性,并应用该模型分析了Sogou搜索引擎一个月内约2 200万条查询日志,分析结果对于掌握用户搜索行为,评测及改进搜索引擎检索、排序算法等均有较好的指导意义。

关键词: Hadoop, 分布式计算, 用户行为分析, 海量数据

Abstract:

Search engine user behaviors analysis is a focus of network information retrieval. It is a method of analyzing the user’s behaviors through clicks to mine useful information to improve search engine’s efficiency and retrieval services. In face of easy expansion and programming bottlenecks in traditional parallel computation models, a massive log data processing model based on Hadoop is given, which improves scalability and easy programming through Hadoop Distributed File System and MapReduce. Moreover, the experiment of analyzing 22 million query logs of the Sogou search engine in one month is carried out based on this model. The analysis result is instructive and meaningful to mastering the  user’s behaviors, evaluating and improving the searching and sorting algorithms.

Key words: Hadoop;distributed computing;user behavior analysis;massive data