• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (08): 1402-1408.

• 计算机网络与信息安全 • 上一篇    下一篇

基于差分隐私保护的Stacking集成聚类算法研究

李帅1,2,常锦才1,2,李吕牧之1,2,蔡昆杰1,2   

  1. (1.华北理工大学理学院,河北 唐山 063210;2.河北省数据科学与应用重点实验室,河北 唐山  063210)
  • 收稿日期:2021-07-21 修回日期:2021-09-17 接受日期:2022-08-25 出版日期:2022-08-25 发布日期:2022-08-25
  • 基金资助:
    华北空管局科技项目(HBKG202002);唐山市科学与工程计算创新团队项目(18130209B)

A Stacking ensemble clustering algorithm based on differential privacy protection

LI Shuai1,2,CHANG Jin-cai1,2,LI-L Mu-zhi1,2,CAI Kun-jie1,2   

  1. (1.College of Science,North China University of Science and Technology,Tangshan 063210;
    2.Hebei Provincial Key Laboratory of Data Science and Application,Tangshan 063210,China)
  • Received:2021-07-21 Revised:2021-09-17 Accepted:2022-08-25 Online:2022-08-25 Published:2022-08-25

摘要: 针对差分隐私保护下单一聚类算法准确性和安全性不足的问题,提出了一种基于差分隐私保护的Stacking集成聚类算法。使用Stacking集成多种异质聚类算法,将K-means聚类、Birch层次聚类、谱聚类和混合高斯聚类作为初级聚类算法,结合轮廓系数对初级聚类算法产生的聚类结果加权并入原始数据,将K-means算法作为次级聚类算法对扩展后的数据集进行聚类分析。其中,针对原始数据和初级聚类算法的聚类结果分别提出自适应的ε函数确定隐私预算,为不同敏感度的数据分配不同程度的Laplace噪声。理论分析和实验结果均表明,与单一聚类算法相比,该算法满足ε-差分隐私保护的同时有效提高了聚类准确性,实现了隐私保护与数据可用性的高度平衡。

关键词: 差分隐私, 集成聚类, Stacking算法, 自适应ε, 隐私保护

Abstract: Aiming at the problem that the accuracy and security of the single clustering algorithm under differential privacy protection are insufficient, a stacking ensemble clustering algorithm based on differential privacy protection is proposed. Stacking is used to integrate a variety of heterogeneous clustering algorithms. K-means clustering, birch hierarchical clustering, spectral clustering and gaussian mixture clustering are used as primary clustering algorithms. By combining the contour coefficient, the clustering results generated by the primary clustering algorithms are weighted into the original data. K-means algorithm is used as the secondary clustering algorithm to cluster the expanded data set. According to the clustering results of the original data and the primary clustering algorithms, adaptive ε  functions are proposed to determine the privacy budget, and different degrees of Laplace noise are allocated to the data with different sensitivities. Theoretical analysis and experimental results show that, compared with the single clustering algorithm, the proposed algorithm can effectively improve the clustering accuracy while satisfying the  ε-differential privacy protection, and achieve a high balance between privacy protection and data availability.

Key words: differential privacy, ensemble clustering, Stacking algorithm, self-adaption ε, privacy protection