• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (02): 292-302.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于模型的非凸聚类算法

钟卓辉1,2,陈黎飞1,2,3   

  1. (1.福建师范大学计算机与网络空间安全学院,福建  福州  350117;
    2.数字福建环境监测物联网实验室,福建  福州  350117;3.福建省应用数学中心,福建  福州  350117)

  • 收稿日期:2022-06-28 修回日期:2022-09-22 接受日期:2024-02-25 出版日期:2024-02-25 发布日期:2024-02-24
  • 基金资助:
    国家自然科学基金(U1805263,61976053)

A model-based non-convex clustering algorithm

ZHONG Zhuo-hui1,2,CHEN Li-fei1,2,3   

  1. (1.College of Computer and Cyber Security,Fujian Normal University,Fuzhou 350117;
    2.Digital Fujian Internet-of-Things Laboratory of Environment Monitoring,Fuzhou 350117;
    3.Center for Applied Mathematics of Fujian Province,Fuzhou 350117,China)
  • Received:2022-06-28 Revised:2022-09-22 Accepted:2024-02-25 Online:2024-02-25 Published:2024-02-24

摘要: 由于数据可能分布在非规则的流形上,其中潜在的簇往往呈现非凸的形状和结构,针对这类数据的聚类问题被统称为非凸聚类。现有的主流非凸聚类方法包括基于原始空间的方法和基于空间变换的方法,均忽略了非凸数据模式的显式描述。提出一种描述性模型用于非凸聚类。首先,基于核密度方法定义了一种具有混合形式的特征加权核密度模型,其无需事先假定任何概率分布模型且不限制簇的形状,这是传统基于模型的聚类方法无法实现的。其次,基于提出的模型推导了聚类目标函数,并基于期望最大化算法提出一种求解密度函数局部区域密度极大值的优化算法,那些上升到密度函数相同密度极大值的样本点被划分为同一个簇。最后,定义了一种基于模型的非凸聚类算法。算法不需人为定义簇的数量,并且能够为每个簇分配一个显式的概率密度函数,有助于更稳健和更准确地表征集群。除此之外,算法不仅在优化过程中进行自适应带宽选择,而且在优化过程中赋予了样本空间特征权重,实现了嵌入式特征选择。

关键词: 非凸聚类, 描述性模型, 基于模型的聚类, 特征选择, 核密度估计, 局部密度极大值

Abstract: Since data may be distributed on an irregular manifold, where the underlying clusters often exhibit non-convex shapes and structures, the clustering problems for such data are collectively referred to as non-convex clustering. However, existing mainstream non-convex clustering methods, including clustering based on original space and clustering based on Space Transformation, ignore the explicit description of non-convex data patterns and fail to understand and describe the underlying mechanisms that produce such structures. Therefore, a descriptive clustering model is proposed to act on non-convex clustering. Firstly, a feature-weighted kernel density model with a hybrid form is defined based on the kernel density method that does not need to assume any probability distribution model in advance and does not restrict the shape of clusters, which cannot be achieved by traditional model-based cluster- ing methods. Secondly, the clustering objective function is derived based on the proposed model, and an optimization algorithm for solving the local density maximum of the density function is proposed based on the expectation maximization algorithm. The clusters are divided into those samples that rise to the same density maximum of the density function. Finally, a model-based non-convex clustering algorithm is defined. The algorithm does not need to manually define the number of clusters, and can assign an explicit probability density function to each cluster, which helps to characterize clusters more robustly and accurately. In addition, the algorithm not only performs adaptive bandwidth selection, but also gives the feature weight of the sample space, enabling the automatic embedded feature selection during the clustering process.

Key words: non-convex clustering, descriptive model, model-based clustering, feature selection, kernel density estimation, local density maximum