主题网络爬虫研究综述

J4 ›› 2015, Vol. 37 ›› Issue (02): 231-237.

主题网络爬虫研究综述

于娟，刘强

（福州大学经济与管理学院，福建福州 350108）

收稿日期:2013-08-27 修回日期:2013-10-18 出版日期:2015-02-25 发布日期:2015-02-25
基金资助:
国家自然科学基金资助项目（71201032）；福建省社会科学规划资助项目（2012C021）；福建省教育厅社会科学研究资助项目（JA11040S）

Survey on topic-focused crawlers

YU Juan，LIU Qiang

（School of Economics and Management,Fuzhou University,Fuzhou 350108,China）

Received:2013-08-27 Revised:2013-10-18 Online:2015-02-25 Published:2015-02-25

摘要/Abstract

摘要：

网络信息资源呈指数级增长，面对用户越来越个性化的需求，主题网络爬虫应运而生。主题网络爬虫是一种下载特定主题网页的程序。利用在采集页面过程获得的特定信息，主题网络爬虫抓取的页面都是与主题相关的。基于主题网络爬虫的搜索引擎以及基于主题网络爬虫构建领域语料库等应用已经得到广泛运用。首先介绍了主题爬虫的定义、工作原理；然后介绍了近年来国内外关于主题爬虫的研究状况，并比较了各种爬行策略及相关算法的优缺点；最后提出了主题网络爬虫未来的研究方向。关键词：

关键词: 网络爬虫, 主题爬虫, 搜索引擎

Abstract:

With the exponential growth of network information resources and the growing personalized demands of customers, topicfocused crawler emerges as the times require. Topicfocused crawlers are programs designed to download web pages which are relevant to specific topics. Using information gathered at running time, topicfocused crawlers explore the webs which follow promissory hyperlinks, and fetch only pages which appear to be relevant. The searching engine and corpus building based on topicfocused crawling have been widely used. We first define the goals and operating principles of focused crawling, comprehensively analyze the recent advances at home and abroad, and then compare the crawling strategies of various topicfocused crawlers as well as the advantages and disadvantages of related algorithms. Finally, we point out the future direction of topicfocused crawling.

Key words: web crawler;focused-crawler;searching engine

于娟，刘强. 主题网络爬虫研究综述[J]. J4, 2015, 37(02): 231-237.

YU Juan，LIU Qiang. Survey on topic-focused crawlers [J]. J4, 2015, 37(02): 231-237.

[1]	王淑芬1,高军礼1,邹普1，宋海涛2. 基于Hadoop的广域网分布式主题爬虫系统框架[J]. J4, 2015, 37(04): 670-675.
[2]	周敬才1，胡华平1,2，岳虹1. 基于Lucene全文检索系统的设计与实现[J]. J4, 2015, 37(02): 252-256.
[3]	舒忠梅, 左亚尧, 张祖传. 时态信息的语义抽取与排序方法研究及系统实现[J]. 计算机工程与科学, 2014, 36(08): 1609-1614.
[4]	屈振新，朱文昌. 基于云计算的定向搜索监控研究[J]. J4, 2013, 35(1): 82-87.
[5]	白玉昭，梁久祯. 基于概率模型的主题爬虫的研究和实现[J]. J4, 2013, 35(1): 160-165.
[6]	王振宇1，唐远华1，郭力2. 面向分层结构的网页分类与抓取[J]. J4, 2012, 34(11): 1-6.
[7]	陈建峡，黄日，马忠宝. 基于PageRank的Lucene排序算法优化与实现[J]. J4, 2012, 34(10): 123-127.
[8]	彭冬，蔡皖东. 面向Web论坛的网络信息获取技术及系统实现[J]. J4, 2011, 33(1): 157-160.
[9]	蒋宗礼，田晓燕，赵旭. 一种基于语义分析的主题爬虫算法[J]. J4, 2010, 32(9): 145-147.
[10]	李勇韩亮. 主题搜索引擎中网络爬虫的搜索策略研究[J]. J4, 2008, 30(3): 4-6.
[11]	吴柏林[1] 宋泽锋[2] 杨炳儒[2]. 一种基于本体的垂直搜索引擎系统模型[J]. J4, 2008, 30(10): 5-7.
[12]	赵贻竹[1] 鲁宏伟[1] 郭俊甫[2]. Google硬件体系结构分析[J]. J4, 2007, 29(9): 45-48.
[13]	彭波. 大规模搜索引擎检索系统框架与实现要点[J]. J4, 2006, 28(3): 1-4.
[14]	肖毅[1] 甘仲惟[2] 肖明[3] 赵慧[4]. 基于移动Agent的个性化信息服务系统的设计与实现[J]. J4, 2006, 28(3): 36-38.
[15]	程菲汪建海罗键. 增量更新Crawler进行Web收集方法研究[J]. J4, 2006, 28(12): 28-30.