• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (02): 231-239.

• 论文 • Previous Articles     Next Articles

Parallel implementation of LDA algorithm based on Hadoop  

ZHANG Zhao1,2,3,ZHANG Xinfeng1,2,3,ZHENG Nan1,2,3,GUI Mingjun1,2,3   

  1. (1.College of Electronic Information and Control Engineering,Beijing University of Technology,Beijing 100124;
    2.Engineering Research Center of Digital Community,Ministy of Education,Beijing 100124;
    3.Beijing Laboratory for Urban Mass Transit,Beijing 100124,China)
  • Received:2015-09-13 Revised:2015-11-15 Online:2016-02-25 Published:2016-02-25

Abstract:

With the rapid development of the Internet, the amount of data which needs to be dealt with is increasing constantly. The traditional standalone text clustering algorithm cannot meet the requirements of largescale data processing in the field of data mining. In order to solve the problem that standalone LDA algorithm is incapable of analyzing and dealing with largescale data, we propose a distributed parallel LDA program using Gibbs sampling based on the MapReduce framework. By utilizing the MapReduce distributed computing framework, we study the distributed implementation of LDA topic model, and test the performance of the distributed computing programs. Through the comparison tests between distributed computing based on Hadoop and standalone computing, we find out that the method can enhance the running speed of the algorithm when dealing with largescale data. As the number of clustering nodes is increasing, the proposal also has good speedup performance. The parallel implementation of the LDA algorithm is feasible, which can    solve the problem that standalone LDA model is incapable of analyzing and dealing with the latent topic information of largescale data.

Key words: Hadoop;MapReduce;LDA topic model;Gibbs;Chinese word segmentation;parallel computing