• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2013, Vol. 35 ›› Issue (1): 160-165.

• 论文 • Previous Articles     Next Articles

Research and implementation for focused crawler based on probabilistic model

BAI Yuzhao,LIANG Jiuzhen   

  1. (School of Internet of Things Engineering,Jiangnan University,Wuxi 214122,China)
  • Received:2011-10-28 Revised:2011-12-30 Online:2013-01-25 Published:2013-01-25

Abstract:

Based on the study and research of the existing variety of focused crawlers, the paper proposes a focused crawler using probabilistic model, which analyzes various characteristics obtained in crawl process and uses probabilistic model to calculate each URL priority so as to filter and sort URLs. The proposed focused crawler based on probabilistic model solves the deficiency that most existing crawlers usually only adopt a single strategy for fetching webs from Internet. The distinct feature of our focused crawler is that: not only subject relativity but also history evaluation and web equality are considered so that the “topic drift” and “tunneling” problems are solved as well as the resource equality is guaranteed. Experimental results show that, compared with other focused crawlers, the focused crawler based on probabilistic prediction can gather more subject relevant web pages by retrieving less web pages, and has a better average topic relevant degree.

Key words: focused crawler;probabilistic model;URL filtering;URL ordering;priority value