• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (02): 372-380.

• 人工智能与数据挖掘 • 上一篇    

基于密度和中心指标的Canopy二分K-均值算法优化

沈郭鑫1,蒋中云2   

  1. (1.上海海洋大学信息学院, 上海 201306;2.上海建桥学院信息技术学院, 上海 201306)
  • 收稿日期:2020-05-26 修回日期:2020-09-21 接受日期:2022-02-25 出版日期:2022-02-25 发布日期:2022-02-18
  • 基金资助:
    上海市属高校应用型本科试点专业基金(Z32004-17-84)

A Canopy bisecting K-Means algorithm based on density and central index

SHEN Guo-xin1,JIANG Zhong-yun2   

  1. (1.College of Information,Shanghai Ocean University,Shanghai 201306;

    2.College of Information,Shanghai Jian Qiao University,Shanghai 201306,China)

  • Received:2020-05-26 Revised:2020-09-21 Accepted:2022-02-25 Online:2022-02-25 Published:2022-02-18

摘要: 针对二分K-均值算法由于随机选取初始中心及人为定义聚类数而造成的聚类结果不稳定问题,提出了基于密度和中心指标的Canopy二分K-均值算法SDC_Bisecting K-Means。首先计算样本中数据密度及其邻域半径;然后选出密度最小的数据并结合Canopy算法的思想进行聚类,将得到的簇的个数及其中心作为二分K-均值算法的输入参数;最后在二分K-均值算法的基础上引入指数函数和中心指标对原始样本进行聚类。利用UCI数据集和自建数据集进行模拟实验对比,结果表明SDC_Bisecting K-Means不仅使得聚类结果更精确,同时算法的运行速度更快、稳定性更好。

关键词: 聚类, 二分K-均值算法, 密度, 邻域半径, 指数函数, 中心指标

Abstract: Aiming at the problem of unstable clustering results caused by the random selection of initial centers and artificially defining the number of clusters in the bisecting K-means algorithm, a Canopy bisecting K-means algorithm based on density and center index is proposed. Firstly, the algorithm calculates the data density in the sample and its neighborhood radius. Secondly, the data with the smallest density are selected and the ideas of the Canopy algorithm is combined for clustering. The number of clusters and cluster centers are obtained as the input parameters of the bisecting K-means algorithm. Finally, based on the bisecting K-means algorithm, the exponential function and central index are introduced to cluster the original samples. UCI data set and self-built data set were used to compare simulation experiments. The results show that the algorithm not only makes the clustering results more accurate and faster, but also has better stability.


Key words: clustering, bisecting K-Means algorithm, density, neighborhood radius, exponential function, central index