Parallel implementation of LDA algorithm based on Hadoop

J4 ›› 2016, Vol. 38 ›› Issue (02): 231-239.

• 论文 • Previous Articles Next Articles

Parallel implementation of LDA algorithm based on Hadoop

ZHANG Zhao1,2,3,ZHANG Xinfeng1,2,3,ZHENG Nan1,2,3,GUI Mingjun1,2,3

(1.College of Electronic Information and Control Engineering,Beijing University of Technology,Beijing 100124;
2.Engineering Research Center of Digital Community,Ministy of Education,Beijing 100124;
3.Beijing Laboratory for Urban Mass Transit,Beijing 100124,China)

Received:2015-09-13 Revised:2015-11-15 Online:2016-02-25 Published:2016-02-25

Abstract

Abstract:

With the rapid development of the Internet, the amount of data which needs to be dealt with is increasing constantly. The traditional standalone text clustering algorithm cannot meet the requirements of largescale data processing in the field of data mining. In order to solve the problem that standalone LDA algorithm is incapable of analyzing and dealing with largescale data, we propose a distributed parallel LDA program using Gibbs sampling based on the MapReduce framework. By utilizing the MapReduce distributed computing framework, we study the distributed implementation of LDA topic model, and test the performance of the distributed computing programs. Through the comparison tests between distributed computing based on Hadoop and standalone computing, we find out that the method can enhance the running speed of the algorithm when dealing with largescale data. As the number of clustering nodes is increasing, the proposal also has good speedup performance. The parallel implementation of the LDA algorithm is feasible, which can solve the problem that standalone LDA model is incapable of analyzing and dealing with the latent topic information of largescale data.

Key words: Hadoop;MapReduce;LDA topic model;Gibbs;Chinese word segmentation;parallel computing

ZHANG Zhao1,2,3,ZHANG Xinfeng1,2,3,ZHENG Nan1,2,3,GUI Mingjun1,2,3. Parallel implementation of LDA algorithm based on Hadoop [J]. J4, 2016, 38(02): 231-239.

Parallel implementation of LDA algorithm based on Hadoop

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 0

Recommended Articles

Metrics

Comments