• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (1): 160-165.

• 论文 • 上一篇    下一篇

基于概率模型的主题爬虫的研究和实现

白玉昭,梁久祯   

  1. (江南大学物联网工程学院,江苏 无锡 214122)
  • 收稿日期:2011-10-28 修回日期:2011-12-30 出版日期:2013-01-25 发布日期:2013-01-25
  • 作者简介:白玉昭(1986),男,河北邯郸人,硕士生,研究方向为信息检索和搜索引擎。
  • 基金资助:

    国家自然科学基金资助项目(61170121)

Research and implementation for focused crawler based on probabilistic model

BAI Yuzhao,LIANG Jiuzhen   

  1. (School of Internet of Things Engineering,Jiangnan University,Wuxi 214122,China)
  • Received:2011-10-28 Revised:2011-12-30 Online:2013-01-25 Published:2013-01-25

摘要:

在现有多种主题爬虫的基础上,提出了一种基于概率模型的主题爬虫。它综合抓取过程中获得的多方面的特征信息来进行分析,并运用概率模型计算每个URL的优先值,从而对URL进行过滤和排序。基于概率模型的主题爬虫解决了大多数爬虫抓取策略单一这个缺陷,它与以往主题爬虫的不同之处是除了使用主题相关度评价指标外,还使用了历史评价指标和网页质量评价指标,较好地解决了“主题漂移”和“隧道穿越”问题,同时保证了资源的质量。最后通过多组实验验证了其在主题网页召回率和平均主题相关度上的优越性。

关键词: 主题爬虫;概率模型;URL过滤;URL排序;优先值

Abstract:

Based on the study and research of the existing variety of focused crawlers, the paper proposes a focused crawler using probabilistic model, which analyzes various characteristics obtained in crawl process and uses probabilistic model to calculate each URL priority so as to filter and sort URLs. The proposed focused crawler based on probabilistic model solves the deficiency that most existing crawlers usually only adopt a single strategy for fetching webs from Internet. The distinct feature of our focused crawler is that: not only subject relativity but also history evaluation and web equality are considered so that the “topic drift” and “tunneling” problems are solved as well as the resource equality is guaranteed. Experimental results show that, compared with other focused crawlers, the focused crawler based on probabilistic prediction can gather more subject relevant web pages by retrieving less web pages, and has a better average topic relevant degree.

Key words: focused crawler;probabilistic model;URL filtering;URL ordering;priority value