• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (9): 145-147.doi: topic crawler;subspace;semanti

• 论文 • 上一篇    下一篇

一种基于语义分析的主题爬虫算法

蒋宗礼,田晓燕,赵旭   

  1. (北京工业大学计算机学院, 北京 100124)
  • 收稿日期:2010-03-12 修回日期:2009-06-17 出版日期:2010-09-02 发布日期:2010-09-02
  • 作者简介:蒋宗礼(1956),男,河南南阳人,教授,CCF会员(E200005392s),研究方向为网络信息处理和并行计算;田晓燕,硕士生,研究方向为网络信息处理和机器学习;赵旭,硕士生,研究方向为网络信息处理和机器学习。

A Topic Crawler AlgorithmBased on Semantic Analysis

JIANG Zongli,TIAN Xiaoyan,ZHAO Xu   

  1. (School of Computer Science,Beijing University of Technology,Beijing 100124,China)
  • Received:2010-03-12 Revised:2009-06-17 Online:2010-09-02 Published:2010-09-02

摘要:

海量网页的存在及其量的急速增长使得通用搜索引擎难以为面向主题或领域的查询提供满意结果。本文研究的主题爬虫致力于收集主题相关信息,达到极大降低网页处理量的目的。它通过评价网页的主题相关度,并优先爬取相关度较高的网页。利用一种基于子空间的语义分析技术,并结合贝叶斯以及支持向量机,设计并实现了一个高效的主题爬虫。实验表明,此算法具有很好的准确性和高效性。

关键词: 主题爬虫, 子空间, 语义分析, 支持向量机

Abstract:

Massive web and its rapid growth make it difficult for generalpurpose search engines to provide satisfactory results for the theme or areaoriented queries. This paper studies the subject of gathering information relevant to the subject, to significantly reduce the amount of web pages dealing. By assessing the degree of Web pages, it gives priority to the crawling pages related to a higher degree. Using a subspacebased semantic analysis technique, combined with the Bayesian mechanism and support vector machine, we design and implement an efficient topic crawler. Experiments show that our algorithm has good accuracy and efficiency.

Key words: topic crawler;subspace;semantic analysis;support vector machine