• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (11): 1-6.

• 论文 •    下一篇

面向分层结构的网页分类与抓取

王振宇1,唐远华1,郭力2   

  1. (1.华南理工大学软件学院,广东 广州 510006;
    2.华南理工大学计算机科学与工程学院,广东 广州 510006)
  • 收稿日期:2011-08-31 修回日期:2011-12-12 出版日期:2012-11-25 发布日期:2012-11-25
  • 基金资助:

    广东省科技计划基金资助项目(2010B010600017)

Categorization and Extraction of Web Pages Based on Hierarchy

WANG Zhenyu1,TANG Yuanhua1,GUO Li2   

  1. (1.School of Software Engineering,South China University of Technology,Guangzhou 510006;
    2.School of Computer Science and Engineering,
    South China University of Technology,Guangzhou 510006,China)
  • Received:2011-08-31 Revised:2011-12-12 Online:2012-11-25 Published:2012-11-25

摘要:

传统网络爬虫为基于关键字检索的通用搜索引擎服务,无法抓取网页类别信息,给文本聚类和话题检测带来计算效率和准确度问题。本文提出基于站点分层结构的网页分类与抽取,通过构建虚拟站点层次分类树并抽取真实站点分层结构,设计并实现了面向分层结构的网页抓取;对于无分类信息的站点,给出了基于标题的网页分类技术,包括领域知识库构建和基于《知网》的词语语义相似度计算。实验结果表明,该方法具有良好的分类效果。

关键词: 网络爬虫, 网页分类, 领域知识库, 知网

Abstract:

Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.

Key words: web crawler;page categorization;domain knowledge base;Hownet