• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (1): 157-160.doi: 10.3969/j.issn.1007130X.2011.

• 论文 • Previous Articles     Next Articles

The Web Forum Crawling Technology and System Implementation

PENG Dong,CAI Wandong   

  1. (School of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China)
  • Received:2010-01-04 Revised:2010-05-13 Online:2011-01-25 Published:2011-01-25

Abstract:

The Web spider is very important in gathering information, which also faces new challenges when it's been used in crawling the Web forum. This paper mainly studies the basic technologies of crawling in the Web forum, designs and implements such a system, which is mainly used to gather the information of the Web forum. According to the information structure, a traversal strategy is proposed. Based on the distribution of the context, a DOM and block algorithm is proposed. The experimental result shows that the traversal strategy is more efficient than the traditional traverses to get those highly subjectrelevant Web pages, and after using the strategy for the context extracting of Web pages, effectively improves the accuracy of the information collection.

Key words: web spider;web forum;context extracting;subject relevant