• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (10): 1838-1847.

• 人工智能与数据挖掘 • 上一篇    下一篇

一种去除聚类数量和邻域参数设置的自适应聚类算法

张柏恺1,杨德刚1,2,冯骥1,2   

  1. (1.重庆师范大学计算机与信息科学学院,重庆 401331; 2.教育大数据智能感知与应用重庆市工程研究中心,重庆 401331)
  • 收稿日期:2020-08-06 修回日期:2020-11-23 接受日期:2021-10-25 出版日期:2021-10-25 发布日期:2021-10-22
  • 作者简介:张柏恺 (1996),男,山东潍坊人,硕士生,CCF会员(E9765G),研究方向为数据分析。
  • 基金资助:
    教育部人文社会科学研究项目(18XJC880002,20YJAZH084);重庆市教委科学技术研究项目(KJQN201800539);重庆市基础科学与前沿技术项目(cstc2016jcyjA0419)

A self-adaptive clustering algorithm without neighborhood parameter k and cluster number c

ZHANG Bo-kai1,YANG De-gang1,2,FENG Ji1,2#br#

#br#
  

  1. (1.College of Computer and Information Science,Chongqing Normal University,Chongqing 401331;

    2.Chongqing Engineering Research Center of 

    Educational Big Data Intelligent Perception and Application,Chongqing 401331,China)

  • Received:2020-08-06 Revised:2020-11-23 Accepted:2021-10-25 Online:2021-10-25 Published:2021-10-22
  • About author:ZHANG Bo-kai ,born in 1996,MS candidate,CCF member(E9765G),his research interest includes data analysis.

摘要: 传统聚类方法往往无法避免邻域参数和聚类数量的选择问题,而这些参数在不同形状的数据中的最优选择也不尽相同,需要根据大量先验知识确定合适的参数选择范围。针对上述参数选择问题,提出了一种基于自然邻居思想的边界剥离聚类算法NaN-BP,能够在无需设置邻域参数和聚类数量的情况下得到令人满意的聚类结果。算法核心思想是首先根据数据集的分布特征,自适应迭代至对数稳定状态并获取邻域信息,并根据该邻域信息进行边界点的标记与剥离,最终以核心点为数据簇中心进行聚类。在不同规模不同分布的数据集上进行了广泛的对比实验,实验结果表明了NaN-BP的自适应性和有效性,取得了令人满意的实验结果。


关键词: 聚类分析, 自适应, 自然邻居, 对数稳定状态, 核心点

Abstract: Traditional clustering methods often cannot avoid the selection of neighborhood parameters and the number of clusters. The optimal selection of these parameters in different shapes of data is hard to choose, and this choice is depending on prior knowledge. Aiming at the above parameter selection problem, this paper proposes a natural neighbors based border peeling clustering algorithm (NaN-BP), which can obtain satisfactory clustering results without setting the neighborhood parameters and the number of clusters. The core idea of the algorithm is to adaptively iterate to a logarithmic stable state and obtain neighborhood information according to the distribution characteristics of the data set, then mark and strip the boundary points according to the neighborhood information, and finally gather the core points as the center of the data cluster. Extensive comparative experiments is conducted on data sets of different scales and distributions, and satisfactory experimental results verify the adaptability and effectiveness of the algorithm.


Key words: clustering analysis, self-adaptive, natural neighbor, logarithmic steady state, core point