基于模型的非凸聚类算法

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (02): 292-302.

基于模型的非凸聚类算法

钟卓辉1,2，陈黎飞1,2,3

（1.福建师范大学计算机与网络空间安全学院，福建福州 350117；
2.数字福建环境监测物联网实验室，福建福州 350117；3.福建省应用数学中心，福建福州 350117）

收稿日期:2022-06-28 修回日期:2022-09-22 接受日期:2024-02-25 出版日期:2024-02-25 发布日期:2024-02-24
基金资助:
国家自然科学基金（U1805263,61976053）

A model-based non-convex clustering algorithm

ZHONG Zhuo-hui1,2，CHEN Li-fei1,2,3

(1.College of Computer and Cyber Security,Fujian Normal University,Fuzhou 350117;
2.Digital Fujian Internet-of-Things Laboratory of Environment Monitoring,Fuzhou 350117;
3.Center for Applied Mathematics of Fujian Province,Fuzhou 350117,China)

Received:2022-06-28 Revised:2022-09-22 Accepted:2024-02-25 Online:2024-02-25 Published:2024-02-24

摘要/Abstract

摘要： 由于数据可能分布在非规则的流形上，其中潜在的簇往往呈现非凸的形状和结构，针对这类数据的聚类问题被统称为非凸聚类。现有的主流非凸聚类方法包括基于原始空间的方法和基于空间变换的方法，均忽略了非凸数据模式的显式描述。提出一种描述性模型用于非凸聚类。首先，基于核密度方法定义了一种具有混合形式的特征加权核密度模型，其无需事先假定任何概率分布模型且不限制簇的形状，这是传统基于模型的聚类方法无法实现的。其次，基于提出的模型推导了聚类目标函数，并基于期望最大化算法提出一种求解密度函数局部区域密度极大值的优化算法，那些上升到密度函数相同密度极大值的样本点被划分为同一个簇。最后，定义了一种基于模型的非凸聚类算法。算法不需人为定义簇的数量，并且能够为每个簇分配一个显式的概率密度函数，有助于更稳健和更准确地表征集群。除此之外，算法不仅在优化过程中进行自适应带宽选择，而且在优化过程中赋予了样本空间特征权重，实现了嵌入式特征选择。

关键词: 非凸聚类, 描述性模型, 基于模型的聚类, 特征选择, 核密度估计, 局部密度极大值

Abstract: Since data may be distributed on an irregular manifold, where the underlying clusters often exhibit non-convex shapes and structures, the clustering problems for such data are collectively referred to as non-convex clustering. However, existing mainstream non-convex clustering methods, including clustering based on original space and clustering based on Space Transformation, ignore the explicit description of non-convex data patterns and fail to understand and describe the underlying mechanisms that produce such structures. Therefore, a descriptive clustering model is proposed to act on non-convex clustering. Firstly, a feature-weighted kernel density model with a hybrid form is defined based on the kernel density method that does not need to assume any probability distribution model in advance and does not restrict the shape of clusters, which cannot be achieved by traditional model-based cluster- ing methods. Secondly, the clustering objective function is derived based on the proposed model, and an optimization algorithm for solving the local density maximum of the density function is proposed based on the expectation maximization algorithm. The clusters are divided into those samples that rise to the same density maximum of the density function. Finally, a model-based non-convex clustering algorithm is defined. The algorithm does not need to manually define the number of clusters, and can assign an explicit probability density function to each cluster, which helps to characterize clusters more robustly and accurately. In addition, the algorithm not only performs adaptive bandwidth selection, but also gives the feature weight of the sample space, enabling the automatic embedded feature selection during the clustering process.

Key words: non-convex clustering, descriptive model, model-based clustering, feature selection, kernel density estimation, local density maximum

钟卓辉, 陈黎飞, . 基于模型的非凸聚类算法[J]. 计算机工程与科学, 2024, 46(02): 292-302.

ZHONG Zhuo-hui, CHEN Li-fei, . A model-based non-convex clustering algorithm[J]. Computer Engineering & Science, 2024, 46(02): 292-302.

[1]	武培成, 赵旭俊, 靳黎忠. 基于网格密度积叠的流数据异常检测[J]. 计算机工程与科学, 2025, 47(01): 75-85.
[2]	刘振超, 苑迎春, 王克俭, 何晨. 融合特征权重与改进粒子群优化的特征选择算法[J]. 计算机工程与科学, 2024, 46(02): 282-291.
[3]	赵瑞平, 降爱莲. 基于自编码器和局部嵌入的无监督特征选择[J]. 计算机工程与科学, 2023, 45(07): 1282-1291.
[4]	顾楚梅, 曹建军, 王保卫, 徐雨芯, . 基于蚁群参数优化的LightGBM辐射源个体识别[J]. 计算机工程与科学, 2023, 45(01): 85-94.
[5]	文武, 万玉辉, 文志云, . 基于正余弦算法的文本特征选择[J]. 计算机工程与科学, 2022, 44(08): 1467-1473.
[6]	刘云, 肖添, 王梓宇. 动态特征选择算法对恶意行为检测的优化研究[J]. 计算机工程与科学, 2022, 44(04): 665-673.
[7]	吴尚智, 徐丹丹, 王旭文, 夏宁. 基于广义重要度和runner-root算法的特征选择[J]. 计算机工程与科学, 2022, 44(04): 723-729.
[8]	李雨晨, 魏巍, 白伟明, 王达. 基于标签共现关系的多标签特征选择[J]. 计算机工程与科学, 2021, 43(11): 2049-2055.
[9]	文武, 万玉辉, 张许红, 文志云, . 基于改进CHI和PCA的文本特征选择[J]. 计算机工程与科学, 2021, 43(09): 1645-1652.
[10]	李向军1,2，孔珂2，魏智翔1，王科选1，肖聚鑫1. 面向Android恶意应用静态检测的特征频数差异增强算法[J]. 计算机工程与科学, 2020, 42(06): 993-1002.
[11]	孟昱煜，陈绍立，刘兴长. 面向排序学习的层次聚类特征选择算法[J]. 计算机工程与科学, 2019, 41(12): 2211-2216.
[12]	张守宾，朱习军. 集成学习算法在中医证型分类预测中的应用[J]. 计算机工程与科学, 2019, 41(02): 328-334.
[13]	纪明君，刘漫丹，才乐千. 基于半监督LDA特征子空间优化的人脸识别算法[J]. 计算机工程与科学, 2018, 40(10): 1851-1857.
[14]	吴帅，赵方. 基于随机森林的老年人居住偏好预测研究[J]. 计算机工程与科学, 2018, 40(05): 924-930.
[15]	杨超，蔡晓东，甘凯今，王丽娟. 基于自适应显著特征选择的动态加权平均行人识别模型[J]. 计算机工程与科学, 2017, 39(05): 935-943.