• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2012, Vol. 34 ›› Issue (11): 1-6.

• 论文 •     Next Articles

Categorization and Extraction of Web Pages Based on Hierarchy

WANG Zhenyu1,TANG Yuanhua1,GUO Li2   

  1. (1.School of Software Engineering,South China University of Technology,Guangzhou 510006;
    2.School of Computer Science and Engineering,
    South China University of Technology,Guangzhou 510006,China)
  • Received:2011-08-31 Revised:2011-12-12 Online:2012-11-25 Published:2012-11-25

Abstract:

Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.

Key words: web crawler;page categorization;domain knowledge base;Hownet