• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (02): 217-223.

• 论文 • 上一篇    下一篇

基于Hadoop的可视化Deep Web采集平台设计

刘彤1,张阳2,孙琦2,袁翀2   

  1. (1.北京市计算中心云计算关键技术与应用北京市重点实验室,北京 100094;
    2.北京市计算中心云计算关键技术与应用北京市重点实验室物联网与大数据应用事业部,北京 100094)
  • 收稿日期:2015-09-10 修回日期:2015-11-13 出版日期:2016-02-25 发布日期:2016-02-25
  • 基金资助:

    国家自然科学基金(71303023);北京市科学技术研究院萌芽计划基金

Design of a visual Deep Web crawler platform based on Hadoop 

LIU Tong1,ZHANG Yang2,SUN Qi2,YUAN Chong2   

  1. (1.Beijing Key Laboratory of Cloud Computing Key Technology and Application,Beijing Computing Center,Beijing 100094;
    2.Department of ToT and Big Data Applications,Beijing Key Laboratory of Cloud Computing Key
    Technology and Application,Beijing Computing Center,Beijing 100094,China)
  • Received:2015-09-10 Revised:2015-11-13 Online:2016-02-25 Published:2016-02-25

摘要:

随着信息技术的发展,互联网信息资源变得越来越丰富,大数据技术的发展使得我们能够从互联网复杂的信息数据中获得相应的知识。这其中最基本的技术就是大数据采集技术,它使我们能够将互联网数据快速采集下来并结构化存储。设计的基于Hadoop的可视化Deep Web采集平台是一种简单易操作的高效深度采集平台,运用Webkit技术作为核心引擎实现可视化配置和深度采集功能,同时通过优化采集算法,调整Hadoop任务分配策略提升效率。 实验结果表明,设计的数据采集平台获得了较好的效果。

关键词: 数据采集, Hadoop, 可视化

Abstract:

With the development of IT technology, internet information resources become much richer. We can obtain relevant knowledge from complicated internet information thanks to the rapid development of big data technology. The most essential part is the big data crawler technology which can crawl and save Internet data structurally. In this paper, we present and develop an efficient Deep Web information crawler based on Hadoop. This crawler employs the Webkit as the core engine which can implement the visual configuration and the deep data collection. To improve the efficiency, the data collection algorithm is also optimized by adjusting the strategy of task distribution in Hadoop. Experimental results demonstrate that the developed data collection platform can obtain better results.

Key words: data crawler;Hadoop;visualization