• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2014, Vol. 36 ›› Issue (10): 1860-1865.

Previous Articles     Next Articles

Optimizing load balancing of joins in MapReduce      

ZHAI Hongmin,LIU Guohua,ZHAO Wei,LIU Yuanyuan,ZHAI Hongkun   

  1. (1.School of Computer and Science,Donghua University,Shanghai 201620;
    2.State Grid Corporation of China Heilongjiang Electric Power Company Ltd.,
    Information & Telecommunication Branch,Harbin 150000,China)
  • Received:2014-06-25 Revised:2014-08-30 Online:2014-10-25 Published:2014-10-04

Abstract:

Data analysis and processing is one of the most important tasks in largescale distributed data processing applications.Due to its simplicity and scalability,MapReduce programming model has gradually become the crucial model for largescale distributed data processing systems (eg.Hadoop).Since the data may be uniformly distributed,data skew occurs when MapReduce programming model joins data,thus degrading the join performance severely.To solve data skew,its reason is analyzed,the load balancing cost model is established,and the rangepartitioner algorithm is proposed to control data skew so as to realize load balancing.Experimental results demonstrate that our method can obviously improve the efficiency of joins.

Key words: MapReduce, join, data skew, rangepartitioner, load balancing