• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2016, Vol. 38 ›› Issue (02): 231-239.

• 论文 • 上一篇    下一篇

基于Hadoop平台的LDA算法的并行化实现

张钊1,2,3,张新峰1,2,3,郑楠1,2,3,贵明俊1,2,3   

  1. (1.北京工业大学电子信息与控制工程学院,北京 100124;2.数字社区教育部工程研究中心,北京 100124;
    3.城市轨道交通北京实验室,北京 100124)
  • 收稿日期:2015-09-13 修回日期:2015-11-15 出版日期:2016-02-25 发布日期:2016-02-25
  • 基金资助:

    北京市属高等学校高层次人才引进与培养计划项目(CIT&TCD201504018)

Parallel implementation of LDA algorithm based on Hadoop  

ZHANG Zhao1,2,3,ZHANG Xinfeng1,2,3,ZHENG Nan1,2,3,GUI Mingjun1,2,3   

  1. (1.College of Electronic Information and Control Engineering,Beijing University of Technology,Beijing 100124;
    2.Engineering Research Center of Digital Community,Ministy of Education,Beijing 100124;
    3.Beijing Laboratory for Urban Mass Transit,Beijing 100124,China)
  • Received:2015-09-13 Revised:2015-11-15 Online:2016-02-25 Published:2016-02-25

摘要:

随着互联网的飞速发展,需要处理的数据量不断增加,在互联网数据挖掘领域中传统的单机文本聚类算法无法满足海量数据处理的要求,针对在单机情况下,传统LDA算法无法分析处理大规模语料集的问题,提出基于MapReduce计算框架,采用Gibbs抽样方法的并行化LDA主题模型的建立方法。利用分布式计算框架MapReduce研究了LDA主题模型的并行化实现,并且考察了该并行计算程序的计算性能。通过对Hadoop并行计算与单机计算进行实验对比,发现该方法在处理大规模语料时,能够较大地提升算法的运行速度,并且随着集群节点数的增加,在加速比方面也有较好的表现。基于Hadoop平台并行化地实现LDA算法具有可行性,解决了单机无法分析大规模语料集中潜藏主题信息的问题。

关键词: Hadoop, MapReduce, LDA主题模型, Gibbs, 中文分词, 并行计算

Abstract:

With the rapid development of the Internet, the amount of data which needs to be dealt with is increasing constantly. The traditional standalone text clustering algorithm cannot meet the requirements of largescale data processing in the field of data mining. In order to solve the problem that standalone LDA algorithm is incapable of analyzing and dealing with largescale data, we propose a distributed parallel LDA program using Gibbs sampling based on the MapReduce framework. By utilizing the MapReduce distributed computing framework, we study the distributed implementation of LDA topic model, and test the performance of the distributed computing programs. Through the comparison tests between distributed computing based on Hadoop and standalone computing, we find out that the method can enhance the running speed of the algorithm when dealing with largescale data. As the number of clustering nodes is increasing, the proposal also has good speedup performance. The parallel implementation of the LDA algorithm is feasible, which can    solve the problem that standalone LDA model is incapable of analyzing and dealing with the latent topic information of largescale data.

Key words: Hadoop;MapReduce;LDA topic model;Gibbs;Chinese word segmentation;parallel computing