• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (02): 231-237.

• 论文 • 上一篇    下一篇

主题网络爬虫研究综述

于娟,刘强   

  1. (福州大学经济与管理学院,福建 福州 350108)
  • 收稿日期:2013-08-27 修回日期:2013-10-18 出版日期:2015-02-25 发布日期:2015-02-25
  • 基金资助:

    国家自然科学基金资助项目(71201032);福建省社会科学规划资助项目(2012C021);福建省教育厅社会科学研究资助项目(JA11040S)

Survey on topic-focused crawlers 

YU Juan,LIU Qiang   

  1. (School of Economics and Management,Fuzhou University,Fuzhou 350108,China)
  • Received:2013-08-27 Revised:2013-10-18 Online:2015-02-25 Published:2015-02-25

摘要:

网络信息资源呈指数级增长,面对用户越来越个性化的需求,主题网络爬虫应运而生。主题网络爬虫是一种下载特定主题网页的程序。利用在采集页面过程获得的特定信息,主题网络爬虫抓取的页面都是与主题相关的。基于主题网络爬虫的搜索引擎以及基于主题网络爬虫构建领域语料库等应用已经得到广泛运用。首先介绍了主题爬虫的定义、工作原理;然后介绍了近年来国内外关于主题爬虫的研究状况,并比较了各种爬行策略及相关算法的优缺点;最后提出了主题网络爬虫未来的研究方向。关键词:

关键词: 网络爬虫, 主题爬虫, 搜索引擎

Abstract:

With the exponential growth of network information resources and the growing personalized demands of customers, topicfocused crawler emerges as the times require. Topicfocused crawlers are programs designed to download web pages which are relevant to specific topics. Using information gathered at running time, topicfocused crawlers explore the webs which follow promissory hyperlinks, and fetch only pages which appear to be relevant. The searching engine and corpus building based on topicfocused crawling have been widely used. We first define the goals and operating principles of focused crawling, comprehensively analyze the recent advances at home and abroad, and then compare the crawling strategies of various topicfocused crawlers as well as the advantages and disadvantages of related algorithms. Finally, we point out the future direction of topicfocused crawling.

Key words: web crawler;focused-crawler;searching engine