• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2014, Vol. 36 ›› Issue (03): 404-410.

• 论文 • 上一篇    下一篇

基于多属性的海量Web数据关联存储及检索系统

罗芳1,李春花1,周可1,黄永峰2,廖正霜1   

  1. (1.华中科技大学计算机科学与技术学院,湖北 武汉 430074;2.清华大学电子工程系,北京 100084)
  • 收稿日期:2013-06-08 修回日期:2013-10-20 出版日期:2014-03-25 发布日期:2014-03-25
  • 基金资助:

    国家863计划资助项目(2012AA011004);清华大学自主科研项目基金(20111081023)

An associated storage and retrieval system of massive
Web data based on multi-attributes                  

LUO Fang1,LI Chunhua1,ZHOU Ke1,HUANG Yongfeng2,LIAO Zhengshuang1   

  1. (1.School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074;
    2.Department of Electronic Engineering,Tsinghua University,Beijing 100084,China)
  • Received:2013-06-08 Revised:2013-10-20 Online:2014-03-25 Published:2014-03-25

摘要:

传统的Web数据检索一般采用全文检索方法,该方法具有很好的灵活性,但舆情分析往往需要获得相关的网页属性及统计信息。针对传统的Web检索方法无法满足上述需求,基于Hadoop平台设计并实现了一种基于多属性的海量Web数据的关联存储及检索系统,为舆情分析提供基础检索与统计服务。主要实现HDFS上基于属性的网页数据的分类和聚类存储,解决小文件存储同时提高数据访问吞吐量;建立原始网页数据与属性数据之间的关联映射;基于HBase的已有索引机制,结合分布式本地索引机制解决基于HBase的动态属性多条件选择查询的辅助索引问题。

关键词: 分类存储, 多条件选择查询, 关联映射, 辅助索引

Abstract:

Traditional Web Retrievals commonly use the fulltext search method which has good flexibility. However, as the analysis of public opinion usually needs relative information of web attributes and statistics, the traditional retrieval method can not satisfy it well. An associated storage and retrieval system based on the Hadoop platform is designed and implemented, which can offer good basic service for the analysis of public opinion. Firstly, the associated storage of web data based on HDFS is realized by machine learning. Secondly, the problem of small files storage together with the access efficiency of associated data is solved. Thirdly, the mapping between original web data and the extracted attributes is established. Finally, the retrieval of dynamic multiple attributes based on the existed indexing on HBase and the distributed local indexing are realized.

Key words: category storage;multiconditions selectable query;associated mapping;secondary indexing