• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (5): 126-129.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • 上一篇    下一篇

基于改进遗传算法的聚焦爬虫设计

范会联1,李献礼2,曾广朴1   

  1. (1.长江师范学院数学与计算机学院,重庆 408100;2.长江师范学院网络信息中心,重庆 408100)
  • 收稿日期:2008-03-13 修回日期:2009-12-10 出版日期:2010-04-28 发布日期:2010-05-11
  • 通讯作者: 范会联 E-mail:fhlmx@163.com
  • 作者简介:范会联(1971),男,重庆石柱人,硕士,副教授,CCF会员(E200013523M),研究方向为软件工程和智能信息处理;李献礼,教授,研究方向为非线性算法和数据挖掘;曾广朴,讲师,研究方向为网络信息系统和数据挖掘。
  • 基金资助:
    重庆市教委科学技术研究项目(KJ091309)

Design of a Focused Crawler Based on the Improved Genetic Algorithm

FAN Huilian1,LI Xianli2,ZENG Guangpu1   

  1. (1.School of Mathematics and Computer Science,Yangtze Normal University,Chongqing 408100; 2.Network Information Center,Yangtze Normal University,Chongqing 408100,China)
  • Received:2008-03-13 Revised:2009-12-10 Online:2010-04-28 Published:2010-05-11
  • Contact: FAN Huilian E-mail:fhlmx@163.com

摘要: 本文提出以爬行控制器和页面分析过滤器为核心的聚焦爬虫设计方法。从待检索主题出发,在以改进的遗传算法为基础并结合内容评价和链接结构搜索策略优点的爬行策略引导下,以待爬行URL作为遗传个体,基于主题词集的向量空间模型评估个体适应度,引入新的URL实现交叉、变异操作,将具有相同URL前缀的链接按小生境处理。实践证明,该爬虫具有较好的性能。

关键词: 聚焦爬虫, 爬行控制器, 主题相关度, 数据抽取

Abstract: The paper presents the design method for a focused crawler based on the crawling controller and the page analysis filter. Starting from the theme to be retrieved, the method based on the improved genetic algorithm combines with the advantages of both content evaluation and link structure. The crawler regards the URL link as the genetic individual,and the topicwordsbased VSM is applied to assess individual fitness, and imports new URLs to achieve crossover and mutation operations, and the URLs that have the same prefix are regarded as niche. The experimental results show that the approach has better performance.

Key words: focused crawler;crawling controller;topic relevancy;data extraction

中图分类号: