Spark下的并行多标签最近邻算法

计算机工程与科学

Spark下的并行多标签最近邻算法

王进，夏翠萍，欧阳卫华，王鸿，邓欣，陈乔松

(重庆邮电大学计算智能重庆市重点实验室,重庆 400065)

收稿日期:2016-09-07 修回日期:2016-11-03 出版日期:2017-02-25 发布日期:2017-02-25
基金资助:
重庆市基础与前沿研究计划项目（csts2014jcyjA40001,cstc2014jcyjA40022）;重庆市教委科学技术研究项目（自然科学类）（KJ1400436）

Parallel multi-label K-nearest

neighbor algorithm based on Spark

WANG Jin,XIA Cuiping,OUYANG Weihua,WANG Hong,DENG Xin,CHEN Qiaosong

(Chongqing Key Laboratory of Computational Intelligence，

Chongqing University of Posts and Telecommunications，Chongqing 400065，China)

Received:2016-09-07 Revised:2016-11-03 Online:2017-02-25 Published:2017-02-25

摘要/Abstract

摘要：

随着大数据时代的到来，大规模多标签数据挖掘方法受到广泛关注。多标签最近邻算法MLKNN是一种简单高效、应用广泛的多标签分类方法，其分类精度在很多应用中都高于其他常见的多标签学习方法。然而随着需要处理的数据规模越来越大，传统串行MLKNN算法已经难以满足大数据应用中时间和存储空间上的限制。结合Spark的并行机制和其基于内存的迭代计算特点，提出了一种基于Spark并行框架的MLKNN算法SMLKNN。在Map阶段分别找到待预测样本每个分区的K近邻，随后Reduce阶段根据每个分区的近邻集合确定最终的K近邻，最后并行地对近邻的标签集合进行聚合，通过最大化后验概率准则输出待预测样本的目标标签集合。串行和并行环境下的对比实验结果表明，SMLKNN在保证分类精度的前提下性能与计算资源呈近似线性关系，提高了MLKNN算法对大规模多标签数据的处理能力。

关键词: 多标签学习, 多标签最近邻算法, Spark, 并行

Abstract:

With the advent of big data era, applications of largescale multilabel data mining have attracted extensive attention.The MultiLabel KNearest Neighbor (MLKNN) is a simple, efficient and widely used method which outperforms other traditional multilabel learning algorithms in many realworld applications. However, as an increasing number of data need to be dealt with, the MLKNN algorithm is unable to meet the requirements of time and memory space. Combined with the parallel mechanism and iterative computation in the memory of Spark, we propose an algorithm based on Spark distributed inmemory computing platform, named SMLKNN. First, in the stage of map,we try to find the K nearest neighbors for each partition of the samples to be tested. Then in the reduce stage, we determine the final K nearest neighbors according to the K nearest neighbors of each partition.Finally, we cluster the label sets of the K nearest neighbors in parallel, and output the target label sets using the maximum posterior probability (MAP) principle. The experiments in standalone and cluster environments show that in the premise of ensuring the classification accuracy, the performance of the SMLKNN has an approximate linear relationship with computing resources, and the proposed algorithm can enhance the processing ability of the MLKNN when dealing with large scale multilabel data.

Key words: multi-label learning, MLKNN, Spark, parallel

王进，夏翠萍，欧阳卫华，王鸿，邓欣，陈乔松. Spark下的并行多标签最近邻算法[J]. 计算机工程与科学.

WANG Jin,XIA Cuiping,OUYANG Weihua,WANG Hong,DENG Xin,CHEN Qiaosong.

Parallel multi-label K-nearest

neighbor algorithm based on Spark

[J]. Computer Engineering & Science.

[1]	陈侨安1，李峰1，曹越1，龙明盛1,2. 基于运行数据分析的Spark任务参数优化[J]. J4, 20160101, 38(01): 11-19.
[2]	杨乾明, 邵靖杰, 曾聘, 袁梦, 宋卓秦, 邓秋严, 张剑锋, 王勇. 一种基于Crossbar结构的分布式共享缓存交换机设计与实现[J]. 计算机工程与科学, 2025, 47(6): 951-957.
[3]	李俊哲, 付振新, 杨宏辉, 马银萍, 李若淼, 樊春, . 面向算力网络的跨集群数据迁移系统的设计和实现[J]. 计算机工程与科学, 2025, 47(5): 775-786.
[4]	廉凯成, 杨晨, 朱佳伟, 柴志雷, . 基于Floyd-Steinberg误差扩散的数字半调高效计算[J]. 计算机工程与科学, 2025, 47(5): 875-884.
[5]	李世杰, 刘阳, 唐晋韬, 郄航. 基于孤立集分区的并行Louvain社区发掘算法[J]. 计算机工程与科学, 2025, 47(4): 621-633.
[6]	董勇, 邬会军, 杨梨花, 张伟, 王睿伯, 周恩强. 基于天河互连的并行文件系统网络驱动[J]. 计算机工程与科学, 2025, 47(3): 392-399.
[7]	朱琦, 戴艺, 彭晋韬, 谢旻, 梁崇山, 刘鹏, 杨博, 刘杰, . 基于“天河二号”聚合通信卸载特性的 MPI_Barrier优化[J]. 计算机工程与科学, 2025, 47(3): 400-411.
[8]	张元胤, 肖敏广, 刘志勇, 翁灵玲, 陈志广, 卢宇彤. 基于国产异构众核处理器的等值线与等值面提取算法优化[J]. 计算机工程与科学, 2025, 47(2): 200-209.
[9]	李胜国, 廖霞, 于恒彪, 黄春, 姜浩, 逯喜燕, 王华林, 成礼智. 面向结构矩阵的可扩展并行矩阵乘算法框架[J]. 计算机工程与科学, 2024, 46(9): 1529-1538.
[10]	陆斌, 范强, 周晓磊, 严浩, 王芳潇, . 一种基于超图的多模态多标签分类方法[J]. 计算机工程与科学, 2024, 46(9): 1667-1674.
[11]	代长威, 孔瑞林, 季哲, . 面向离散粒子多尺度分析CPU/GPU架构的并行近邻搜索算法[J]. 计算机工程与科学, 2024, 46(8): 1349-1360.
[12]	杨仕琦, 武优西, 耿萌, 李艳. 一次性条件下的三支序列模式挖掘[J]. 计算机工程与科学, 2024, 46(7): 1286-1295.
[13]	郭宸良, 阎少宏, 宗晨琪. 线云隐私攻击算法的并行加速研究[J]. 计算机工程与科学, 2024, 46(4): 615-625.
[14]	杨航, 山蕊, 杨坤, 崔馨月. 基于动态自重构结构的3D-HEVC帧内预测算法并行化实现[J]. 计算机工程与科学, 2024, 46(11): 1931-1939.
[15]	黄山, 吴煜凡, 吕鹤轩, 段晓东, . 异构微差同步并行训练算法[J]. 计算机工程与科学, 2024, 46(11): 1949-1959.