• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (5): 126-129.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • Previous Articles     Next Articles

Design of a Focused Crawler Based on the Improved Genetic Algorithm

FAN Huilian1,LI Xianli2,ZENG Guangpu1   

  1. (1.School of Mathematics and Computer Science,Yangtze Normal University,Chongqing 408100; 2.Network Information Center,Yangtze Normal University,Chongqing 408100,China)
  • Received:2008-03-13 Revised:2009-12-10 Online:2010-04-28 Published:2010-05-11
  • Contact: FAN Huilian E-mail:fhlmx@163.com

Abstract: The paper presents the design method for a focused crawler based on the crawling controller and the page analysis filter. Starting from the theme to be retrieved, the method based on the improved genetic algorithm combines with the advantages of both content evaluation and link structure. The crawler regards the URL link as the genetic individual,and the topicwordsbased VSM is applied to assess individual fitness, and imports new URLs to achieve crossover and mutation operations, and the URLs that have the same prefix are regarded as niche. The experimental results show that the approach has better performance.

Key words: focused crawler;crawling controller;topic relevancy;data extraction

CLC Number: