• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

Parallel multi-label K-nearest
neighbor algorithm based on Spark

WANG Jin,XIA Cuiping,OUYANG Weihua,WANG Hong,DENG Xin,CHEN Qiaosong   

  1. (Chongqing Key Laboratory of Computational Intelligence,
    Chongqing University of Posts and Telecommunications,Chongqing 400065,China)
  • Received:2016-09-07 Revised:2016-11-03 Online:2017-02-25 Published:2017-02-25

Abstract:

With the advent of big data era, applications of largescale multilabel data mining have attracted extensive attention.The MultiLabel KNearest Neighbor (MLKNN) is a simple, efficient and widely used method which outperforms other traditional multilabel learning algorithms in many realworld applications. However, as an increasing number of data need to be dealt with, the MLKNN algorithm is unable to meet the requirements of time and memory space. Combined with the parallel mechanism and iterative computation in the memory of Spark, we propose an algorithm based on Spark distributed inmemory computing platform, named SMLKNN. First, in the stage of map,we try to find the K nearest neighbors for each partition of the samples to be tested. Then in the reduce stage, we determine the final K nearest neighbors according to the K nearest neighbors of each partition.Finally, we cluster the label sets of the K nearest neighbors in parallel, and output the target label sets using the maximum posterior probability (MAP) principle. The experiments in standalone and cluster environments show that in the premise of ensuring the classification accuracy, the performance of the SMLKNN has an approximate linear relationship with computing resources, and the proposed algorithm can enhance the processing ability of the MLKNN when dealing with large scale multilabel data.
 

Key words: multi-label learning, MLKNN, Spark, parallel