• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2013, Vol. 35 ›› Issue (10): 25-35.

• • 上一篇    下一篇

基于Hadoop生态系统的大数据解决方案综述

陈吉荣,乐嘉锦   

  1. (东华大学算机科学与技术学院,上海 201620)
  • 收稿日期:2013-02-25 修回日期:2013-05-29 出版日期:2013-10-25 发布日期:2013-10-25
  • 基金资助:

    国家核高基项目(2010ZX01042-001-003)

Reviewing the big data solution based on Hadoop ecosystem    

CHEN Ji rong,LE Jia jin   

  1. (School of Computer Science and Technology,Donghua University,Shanghai 201620,China)
  • Received:2013-02-25 Revised:2013-05-29 Online:2013-10-25 Published:2013-10-25

摘要:

一个大数据解决方案需要面对三个关键问题:大数据的存储、大数据的分析和大数据的管理。首先综述了大数据和Hadoop生态系统的定义;然后从商业产品和Hadoop生态系统两个方面来探讨如何面对大数据,重点分析了Hadoop生态系统是如何解决的:分别用HDFS、HBase和OpenTSDB解决存储问题,用Hadoop MapReduce(Hive)和HadoopDB解决分析问题,用Sqoop和Ganglia等解决管理问题。对于每个成员,分别分析了其系统架构、实现原理和特点;对于重点成员,分别分析了其存在的一些问题或缺点,并在总结当前学术和应用的进展基础上,结合我们自身的研究进展,提出了解决方法、解决思路和观点。可以预见,Hadoop生态系统将是中小企业在面对大数据问题时的首选解决方案。

关键词: 大数据, Hadoop生态系统, MapReduce, HDFS, 列存储数据库

Abstract:

Solving big data must deal with three crucial problems: big data storage, big data analysis and big data management. Firstly, the definitions of big data and Hadoop ecosystem are summarized respectively. Secondly, how to face big data is discussed from the two aspects of commercial products and Hadoop ecosystem. The paper focuses on reviewing the big data solution based on Hadoop ecosystem:(1) HDFS, HBase and OpenTSDB are used to deal with storage problems;(2) Hadoop MapReduce(Hive) and HadoopDB do analytical problems; and (3) Sqoop and Ganglia solve management problems. For each partner, its architecture, principles and features are analyzed. And for some defects or problems existing in some key partners, we propose some solutions, ideas and viewpoints based on our research progress. It is predicted that Hadoop ecosystem is the preferable solution for the small and mediumsized enterprises.

Key words: big data, Hadoop ecosystem, MapReduce, HDFS, columnoriented database