• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (04): 670-675.

• 论文 • Previous Articles     Next Articles

A framework of WAN distributed topic
crawling system based on Hadoop   

WANG Shufen1,GAO Junli1,ZOU Pu1,SONG Haitao2   

  1. (1.School of Automation,Guangdong University of Technology,Guangzhou 510006;
    2.School of Business Administration,South China University of Technology,Guangzhou 510641,China)
  • Received:2013-08-12 Revised:2014-04-10 Online:2015-04-25 Published:2015-04-25

Abstract:

Comparing with LAN crawling systems, WAN distributed crawling systems have lots of advantages, however, the existing crawling systems based on Hadoop are mostly used in LAN. To achieve a high computing speed of Hadoop in WAN, we present a crawler framework based on Hadoop. To achieve an extensible storage, all data are stored on the Hadoop distributed file system and the web pages are analyzed through MapReduce in parallel. To obtain reliable communication, a message oriented middleware is used. To make the framework customizable, a template matching method is proposed. The performance simulation shows that the crawler framework can support large scale crawling work.

Key words: WAN based distributed crawler;Hadoop;crawling system framework;templates matching;topic crawler