• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (02): 217-223.

• 论文 • Previous Articles     Next Articles

Design of a visual Deep Web crawler platform based on Hadoop 

LIU Tong1,ZHANG Yang2,SUN Qi2,YUAN Chong2   

  1. (1.Beijing Key Laboratory of Cloud Computing Key Technology and Application,Beijing Computing Center,Beijing 100094;
    2.Department of ToT and Big Data Applications,Beijing Key Laboratory of Cloud Computing Key
    Technology and Application,Beijing Computing Center,Beijing 100094,China)
  • Received:2015-09-10 Revised:2015-11-13 Online:2016-02-25 Published:2016-02-25

Abstract:

With the development of IT technology, internet information resources become much richer. We can obtain relevant knowledge from complicated internet information thanks to the rapid development of big data technology. The most essential part is the big data crawler technology which can crawl and save Internet data structurally. In this paper, we present and develop an efficient Deep Web information crawler based on Hadoop. This crawler employs the Webkit as the core engine which can implement the visual configuration and the deep data collection. To improve the efficiency, the data collection algorithm is also optimized by adjusting the strategy of task distribution in Hadoop. Experimental results demonstrate that the developed data collection platform can obtain better results.

Key words: data crawler;Hadoop;visualization