面向Web论坛的网络信息获取技术及系统实现

彭冬，蔡皖东

doi:10.3969/j.issn.1007130X.2011.

计算机工程与科学 >

2011 , Vol. 33 >Issue 1: 157 - 160

DOI: https://doi.org/10.3969/j.issn.1007130X.2011.

论文

面向Web论坛的网络信息获取技术及系统实现

展开

(西北工业大学计算机学院，陕西西安 710072)

彭冬（1984），男，四川江油人，硕士,研究方向为网络与信息安全。蔡皖东（1955），男,山东文登人，博士，教授，研究方向为网络与信息安全。

收稿日期: 2010-01-04

修回日期: 2010-05-13

网络出版日期: 2011-01-25

基金资助

国家863计划资助项目（2009AA01Z424）;2009届西北工业大学本科毕业设计重点扶持项目

收起

The Web Forum Crawling Technology and System Implementation

Expand

(School of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China)

Received date: 2010-01-04

Revised date: 2010-05-13

Online published: 2011-01-25

Fold

摘要

网络爬虫技术是网络信息获取的重要手段，面向Web论坛的信息获取则是网络爬虫技术所面临的新课题。在分析和研究面向Web论坛信息获取技术的基础上，本文设计和实现了一种用于Web论坛信息获取的主题网络爬虫系统，根据Web论坛信息组织结构，提出了基于遍历策略的信息搜索技术；根据正文信息分布及论坛自身特点，提出了基于DOM与分块算法相结合的正文提取技术。实验结果表明，遍历策略比传统的网络爬虫遍历策略具有更高的效率，能够采集到更多主题相关度高的网页；经过噪声清洗处理后，有效提取网页正文，提高了信息采集精度。

关键词： 网络爬虫; Web论坛; 正文提取; 主题相关度

本文引用格式

彭冬，蔡皖东 . 面向Web论坛的网络信息获取技术及系统实现[J]. 计算机工程与科学, 2011 , 33(1) : 157 -160 . DOI: 10.3969/j.issn.1007130X.2011.

Abstract

The Web spider is very important in gathering information, which also faces new challenges when it's been used in crawling the Web forum. This paper mainly studies the basic technologies of crawling in the Web forum, designs and implements such a system, which is mainly used to gather the information of the Web forum. According to the information structure, a traversal strategy is proposed. Based on the distribution of the context, a DOM and block algorithm is proposed. The experimental result shows that the traversal strategy is more efficient than the traditional traverses to get those highly subjectrelevant Web pages, and after using the strategy for the context extracting of Web pages, effectively improves the accuracy of the information collection.

Key words： web spider;web forum;context extracting;subject relevant

参考文献

［1］李魁.大规模Web论坛采集技术研究［D］.中国优秀博硕士学位论文全文数据库(硕士),2006.
［2］林海霞,原福永,陈金森.主题网络蜘蛛搜索策略贪婪性解决方法［J］.微电子学与计算机,2006,3(S1):278280.
［3］刘金红,陆余良.主题网络爬虫研究综述［J］.计算机应用研究,2007,24(10):2629.
［4］宋宇,孟祥增.基于改进FishSearch算法的多媒体检索［J］.计算机工程,2008,34(11):189190.
［5］时达明,林鸿飞,杨志豪.基于网页框架和规则的网页噪音去除方法［J］.计算机工程,2007,33(19):276278.
［6］郭坤银,邢永康.基于Web标准的页面分块算法研究［J］.微处理机,2009(6):5861.
［7］杨俊,李志蜀.基于DOM 的WEB 主题信息抽取［J］.四川大学学报,2008,45(5):10771080.
［8］张博,蔡皖东.面向主题的网络蜘蛛技术研究及系统实现［J］.微电子学与计算机,2009,26(6):5255.
［9］汪涛,樊孝忠.主题爬虫的设计与实现［J］.计算机应用,2004,24(S1):270272.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献