• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊
论文

MapReduce:新型的分布式并行计算编程模型

展开
  • (华中科技大学计算机科学与技术学院,湖北 武汉 430074)
李成华(1972),男,湖北仙桃人,博士后,CCF会员(E200012566M),研究方向为并行计算、数据挖掘。张新访(1965),男,广东五华人,博士,教授,研究方向为信息安全、云计算、嵌入式系统及应用。

收稿日期: 2009-12-29

  修回日期: 2010-05-04

  网络出版日期: 2011-03-25

MapReduce:a New Programming Model for Distributed Parallel Computing

Expand
  • (School of Computer Science and Technology,
    Huazhong University of Science and Technology,Wuhan 430074,China)

Received date: 2009-12-29

  Revised date: 2010-05-04

  Online published: 2011-03-25

摘要

MapReduce是Google提出的分布式并行计算编程模型,用于大规模数据的并行处理。MapReduce模型受函数式编程语言的启发,将大规模数据处理作业拆分成若干个可独立运行的Map任务,分配到不同的机器上去执行,生成某种格式的中间文件,再由若干个Reduce任务合并这些中间文件获得最后的输出文件。用户在使用MapReduce模型进行大规模数据处理时,可以将主要精力放在如何编写Map和Reduce函数上,其它并行计算中的复杂问题诸如分布式文件系统、工作调度、容错、机器间通信等都交给MapReduce 系统处理,在很大程度上降低了整个编程难度。MapReduce日益成为云计算平台的主流编程模型。Apache Hadoop项目提供开源的MapReduce系统还有待进一步完善。

本文引用格式

李成华,张新访,金海,向文 . MapReduce:新型的分布式并行计算编程模型[J]. 计算机工程与科学, 2011 , 33(3) : 129 -135 . DOI: 10.3969/j.issn.1007130X.2011.

Abstract

MapReduce is a programming model introduced by Google for writing applications that rapidly process vast amounts of data in parallel on large clusters of computing nodes. The model is inspired by map and reduce functions commonly used in functional programming. A Map/Reduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The reduce tasks merge all intermediate values generated by the map tasks. Users only devote themselves to how to specify the map functions and reduce functions. The details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required intermachine communication are taken care of by the runtime system of MapReduce. MapReduce will be widely adopted on the cloud computing platform. Several aspects of the Hadoop MapReduce contributed by Apache remain to be perfected.

参考文献

[1]Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2005,51(1):107113.
[2]Michael I, Mihai B, Yuan Y, et al.Dryad: Distributed Dataparallel Programs from Sequential Building Blocks[J].SIGOPS Oper Syst Rev, 2007,41(3):5972.
[3]Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte[EB/OL].[20090511].http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html.
[4]郑启龙, 王昊,吴晓伟,等. HPMR : 多核集群上的高性能计算支撑平台[J]. 微电子学与计算机,2008(08):2123.
[5]周锋, 李旭伟. 一种改进的MapReduce并行编程模型[J]. 科协论坛, 2009(2):6566.
[6]邓倩妮, 陈全. 云计算及其关键技术[J].高性能计算发展与应用, 2009(1):26.
[7]孙广中, 肖锋, 熊曦. MapReduce 模型的调度及容错机制研究[J]. 微电子学与计算机,2007, 24(9):178180.
[8]吴宝贵,丁振国. 基于Map /Reduce的分布式搜索引擎研究[J]. 现代图书情报技术,2007(8):5255.
[9]郑启龙, 房明, 汪胜,等. 基于MapReduce 模型的并行科学计算[J]. 微电子学与计算机, 2009,26(8):1317.
[10]杨代庆, 张智雄. 基于Hadoop的海量共现矩阵生成方法[J].现代图书情报技术,2009(4):2326.
[11]陈康, 郑纬民. 云计算:系统实例与研究现状[J]. 软件学报, 2009, 20(5):13371348.
[12]Yang H C, Dasdan A, Hsiao R L, et al. MapReduceMerge: Simplified Relational Data Processing on Large Clusters[C]∥Proc of the 2007 ACM SIGMOD Int’l Conf on Management of Data, 2007:10291040.
[13]Ranger C, Raghuraman R, Penmetsa A, et al. Evaluating MapReduce for MultiCore and Multiprocessor Systems[C]∥Proc of the 13th Int’l Symp on HighPerformance Computer Architecture,2007:1324.
[14]de Kruijf M, Sankaralingam K. MapReduce for the Cell B.E. Architecture[R]. Technical Report CSTR20071625, University of Wisconsin Computer Sciences, 2007.
[15]Aguilera M K, Merchant A, Shah M, et al. Sinfonia: A New Paradigm for Building Scalable Distributed Systems[C]∥Proc of the 21st ACM Symp on Operating Systems Principles, 2007:159174.
[16]DeWitt D. MapReduce: A Major Step Backwards[EB/OL].[20080117]. http://www.databasecolumn.com/2008/01/mapreduceamajorstepback.html.

文章导航

/