面向Web论坛的网络信息获取技术及系统实现
收稿日期: 2010-01-04
修回日期: 2010-05-13
网络出版日期: 2011-01-25
基金资助
国家863计划资助项目(2009AA01Z424);2009届西北工业大学本科毕业设计重点扶持项目
The Web Forum Crawling Technology and System Implementation
Received date: 2010-01-04
Revised date: 2010-05-13
Online published: 2011-01-25
彭冬,蔡皖东 . 面向Web论坛的网络信息获取技术及系统实现[J]. 计算机工程与科学, 2011 , 33(1) : 157 -160 . DOI: 10.3969/j.issn.1007130X.2011.
The Web spider is very important in gathering information, which also faces new challenges when it's been used in crawling the Web forum. This paper mainly studies the basic technologies of crawling in the Web forum, designs and implements such a system, which is mainly used to gather the information of the Web forum. According to the information structure, a traversal strategy is proposed. Based on the distribution of the context, a DOM and block algorithm is proposed. The experimental result shows that the traversal strategy is more efficient than the traditional traverses to get those highly subjectrelevant Web pages, and after using the strategy for the context extracting of Web pages, effectively improves the accuracy of the information collection.
[1]李魁.大规模Web论坛采集技术研究[D].中国优秀博硕士学位论文全文数据库(硕士),2006.
[2]林海霞,原福永,陈金森.主题网络蜘蛛搜索策略贪婪性解决方法[J].微电子学与计算机,2006,3(S1):278280.
[3]刘金红,陆余良.主题网络爬虫研究综述[J].计算机应用研究,2007,24(10):2629.
[4]宋宇,孟祥增.基于改进FishSearch算法的多媒体检索[J].计算机工程,2008,34(11):189190.
[5]时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法[J].计算机工程,2007,33(19):276278.
[6]郭坤银,邢永康.基于Web标准的页面分块算法研究[J].微处理机,2009(6):5861.
[7]杨俊,李志蜀.基于DOM 的WEB 主题信息抽取[J].四川大学学报,2008,45(5):10771080.
[8]张博,蔡皖东.面向主题的网络蜘蛛技术研究及系统实现[J].微电子学与计算机,2009,26(6):5255.
[9]汪涛,樊孝忠.主题爬虫的设计与实现[J].计算机应用,2004,24(S1):270272.
/
| 〈 |
|
〉 |