面向分层结构的网页分类与抓取

J4 ›› 2012, Vol. 34 ›› Issue (11): 1-6.

• 论文 • 下一篇

面向分层结构的网页分类与抓取

王振宇1，唐远华1，郭力2

(1.华南理工大学软件学院，广东广州 510006;
2.华南理工大学计算机科学与工程学院,广东广州 510006)

收稿日期:2011-08-31 修回日期:2011-12-12 出版日期:2012-11-25 发布日期:2012-11-25
基金资助:
广东省科技计划基金资助项目（2010B010600017）

Categorization and Extraction of Web Pages Based on Hierarchy

WANG Zhenyu1,TANG Yuanhua1,GUO Li2

(1.School of Software Engineering,South China University of Technology,Guangzhou 510006;
2.School of Computer Science and Engineering,
South China University of Technology,Guangzhou 510006,China)

Received:2011-08-31 Revised:2011-12-12 Online:2012-11-25 Published:2012-11-25

摘要/Abstract

摘要：

传统网络爬虫为基于关键字检索的通用搜索引擎服务，无法抓取网页类别信息，给文本聚类和话题检测带来计算效率和准确度问题。本文提出基于站点分层结构的网页分类与抽取，通过构建虚拟站点层次分类树并抽取真实站点分层结构，设计并实现了面向分层结构的网页抓取；对于无分类信息的站点，给出了基于标题的网页分类技术，包括领域知识库构建和基于《知网》的词语语义相似度计算。实验结果表明，该方法具有良好的分类效果。

关键词: 网络爬虫, 网页分类, 领域知识库, 知网

Abstract:

Traditional web crawler provides services based on searching keywords. It cannot extract the categorization information of web pages, thus resulting in efficiency and accuracy problems on text clustering and topic detection. To solve this problem, a method of categorization and extraction of web pages based on hierarchy is proposed in this paper. By building a virtual hierarchy categorization tree and extracting the hierarchies of real web sites, a web page is categorized when it is crawled. For sites which have no categorization information, a page title based categorization algorithm is presented, including building up the domain knowledge base and calculating the semantic similarity based on Hownet. The experimental results demonstrate that this method achieves preferable effects.

Key words: web crawler;page categorization;domain knowledge base;Hownet

王振宇1，唐远华1，郭力2. 面向分层结构的网页分类与抓取[J]. J4, 2012, 34(11): 1-6.

WANG Zhenyu1,TANG Yuanhua1,GUO Li2 . Categorization and Extraction of Web Pages Based on Hierarchy[J]. J4, 2012, 34(11): 1-6.

[1]	高永兵1,宋添树1,2,李江宇1,马占飞3. 基于知网的个人微博语义相关度的聚类研究[J]. 计算机工程与科学, 2019, 41(06): 1128-1135.
[2]	夏卓群1,2,3,罗君鹏1,2,胡珍珍1,2. 移动感知环境下基于CSA-SSVR的交通状态预测方法[J]. 计算机工程与科学, 2018, 40(08): 1482-1487.
[3]	熊晶1，钟珞2，王爱民1,2. 甲骨文知识图谱构建中的实体关系发现研究[J]. J4, 2015, 37(11): 2188-2194.
[4]	于娟，刘强. 主题网络爬虫研究综述[J]. J4, 2015, 37(02): 231-237.
[5]	马甲林,刘金岭,于长辉. 一种高效中文文本聚类算法[J]. J4, 2013, 35(2): 103-108.
[6]	屈振新，朱文昌. 基于云计算的定向搜索监控研究[J]. J4, 2013, 35(1): 82-87.
[7]	柳平增1，2，孟祥伟1，田盼3，邓振民1，王文山1，王玉存1，毕树生2. 基于物联网的精准农业信息感知系统设计[J]. J4, 2012, 34(3): 137-141.
[8]	程传鹏,吴志刚. 一种基于知网的句子相似度计算方法[J]. J4, 2012, 34(2): 172-175.
[9]	张素琪1,刘恩海2,贺〓亚2,董永峰2. 基于改进的免疫克隆支持向量机网页分类研究[J]. J4, 2011, 33(12): 94-98.
[10]	彭冬，蔡皖东. 面向Web论坛的网络信息获取技术及系统实现[J]. J4, 2011, 33(1): 157-160.
[11]	张鼎兴[1] 徐明[1] 高俊文[2] 刘爱心[3]. 一种多属性目标监测的无线感知网络覆盖算法[J]. J4, 2008, 30(4): 98-100.
[12]	程菲汪建海罗键. 增量更新Crawler进行Web收集方法研究[J]. J4, 2006, 28(12): 28-30.
[13]	刘青[2] 何政[1]. 结合EM算法的朴素贝叶斯方法在中文网页分类上的应用[J]. J4, 2005, 27(7): 65-66.
[14]	郭庚麒[1] 陈启买[2]. 一个基于Web挖掘的中文专业搜索引擎的设计与实现[J]. J4, 2004, 26(9): 16-20.
[15]	童红霞谢深泉. ICAI中知识的表示[J]. J4, 2004, 26(3): 87-89.

面向分层结构的网页分类与抓取

Categorization and Extraction of Web Pages Based on Hierarchy

PDF

可视化

摘要/Abstract

引用本文

使用本文

相关文章 15

编辑推荐

Metrics

本文评价