• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (04): 670-675.

• 论文 • 上一篇    下一篇

基于Hadoop的广域网分布式主题爬虫系统框架

王淑芬1,高军礼1,邹普1,宋海涛2   

  1. (1.广东工业大学自动化学院,广东 广州 510006;2.华南理工大学工商管理学院,广东 广州 510641)
  • 收稿日期:2013-08-12 修回日期:2014-04-10 出版日期:2015-04-25 发布日期:2015-04-25
  • 基金资助:

    国家自然科学基金重大项目(710990403);中央高校基金项目(2014ZM0038);广东省省部产学研结合项目重点引导项目(2011B090400522)

A framework of WAN distributed topic
crawling system based on Hadoop   

WANG Shufen1,GAO Junli1,ZOU Pu1,SONG Haitao2   

  1. (1.School of Automation,Guangdong University of Technology,Guangzhou 510006;
    2.School of Business Administration,South China University of Technology,Guangzhou 510641,China)
  • Received:2013-08-12 Revised:2014-04-10 Online:2015-04-25 Published:2015-04-25

摘要:

广域网分布式爬虫与局域网爬虫相比有诸多的优势,而现有基于Hadoop分布式爬虫的设计主要是面向局域网环境的。为解决Hadoop分布式计算平台不适合部署于广域网的问题,设计了一个基于Hadoop的广域网分布式爬虫系统框架。爬虫系统利用消息中间件实现分布式可靠通信,数据存储采用可伸缩的Hadoop分布式文件系统HDFS,网页解析利用MapReduce并行处理,并基于模板匹配实现框架可定制。系统的性能仿真显示该框架具有支撑大规模爬虫并发工作的能力。

关键词: 分布式爬虫, Hadoop, 爬虫框架, 模板匹配, 主题爬虫

Abstract:

Comparing with LAN crawling systems, WAN distributed crawling systems have lots of advantages, however, the existing crawling systems based on Hadoop are mostly used in LAN. To achieve a high computing speed of Hadoop in WAN, we present a crawler framework based on Hadoop. To achieve an extensible storage, all data are stored on the Hadoop distributed file system and the web pages are analyzed through MapReduce in parallel. To obtain reliable communication, a message oriented middleware is used. To make the framework customizable, a template matching method is proposed. The performance simulation shows that the crawler framework can support large scale crawling work.

Key words: WAN based distributed crawler;Hadoop;crawling system framework;templates matching;topic crawler