• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于MapReduce的并行MRACO-PAM聚类算法

赵宝文,徐华     

  1. (江南大学物联网工程学院,江苏 无锡 214122)
  • 收稿日期:2015-11-06 修回日期:2016-09-21 出版日期:2017-10-25 发布日期:2017-10-25
  • 基金资助:

    江苏省自然科学基金(BK20140165);国家留学基金委赞助项目(201308320030)

A parallel MRACO-PAM clustering
algorithm based on MapReduce
 

ZHAO Bao-wen,XU Hua   

  1. (School of IOT,Jiangnan University,Wuxi 214122,China)
  • Received:2015-11-06 Revised:2016-09-21 Online:2017-10-25 Published:2017-10-25

摘要:

聚类分析是数据处理算法中常用的方法,PAM算法自提出以来便成为了最常使用的聚类算法之一。虽然传统PAM算法解决了K-Means算法在聚类过程中对脏数据敏感的问题,但是传统PAM算法存在收敛速度慢、处理大数据集效率不高等问题。针对这些问题,利用蚁群搜索机制来增强PAM算法的全局搜索能力和局部探索能力,并基于MapReduce并行编程框架提出MRACO-PAM算法来实现并行化计算,并进行实验。实验结果表明,基于MapReduce框架的并行MRACO-PAM聚类算法的收敛速度得到了改善,具备处理大规模数据的能力,而且具有良好的可扩展性。
 

关键词: MapReduce, 蚁群优化(ACO), PAM, 大数据, 并行计算

Abstract:

Clustering analysis is one of the most commonly used data processing algorithms, and the partitioning around medoid (PAM) has been one of the most popular clustering algorithms since it was proposed in 1990. The PAM clustering algorithm solves the problem that the K-Means algorithm encounters when processing outlier data, which is sensitive to dirty data in clustering process. However, the original PAM’s convergence speed is slow and it works inefficiently for large datasets due to its time complexity. To address this problem, we enhance the global and local searching capabilities of the PAM by taking advantage of the ant colony algorithm, and propose a parallel MRACO-PAM clustering algorithm based on MapReduce programming framework. Experimental results demonstrate that the parallel MRACO-PAM algorithm based on MapReduce  improves the convergence speed and is capable of dealing with large-scale data with good scalability.

Key words: MapReduce, ant colony optimization(ACO), partitioning around medoid(PAM), big data, parallel computing