Categorization and Extraction of Web Pages Based on Hierarchy

J4 ›› 2012, Vol. 34 ›› Issue (11): 1-6.

• 论文 • Next Articles

Categorization and Extraction of Web Pages Based on Hierarchy

WANG Zhenyu1,TANG Yuanhua1,GUO Li2

(1.School of Software Engineering,South China University of Technology,Guangzhou 510006;
2.School of Computer Science and Engineering,
South China University of Technology,Guangzhou 510006,China)

Received:2011-08-31 Revised:2011-12-12 Online:2012-11-25 Published:2012-11-25

Abstract

Abstract:

Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.

Key words: web crawler;page categorization;domain knowledge base;Hownet

WANG Zhenyu1,TANG Yuanhua1,GUO Li2 . Categorization and Extraction of Web Pages Based on Hierarchy[J]. J4, 2012, 34(11): 1-6.

Categorization and Extraction of Web Pages Based on Hierarchy

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 0

Recommended Articles

Metrics

Comments