• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于插值的高维稀疏数据离群点检测方法

陈旺虎,田真,张礼智,梁小燕,高雅琼   

  1. (西北师范大学计算机科学与工程学院, 甘肃 兰州 730070)
  • 收稿日期:2019-04-22 修回日期:2019-12-11 出版日期:2020-06-25 发布日期:2020-06-25
  • 基金资助:

    国家自然科学基金(61967013,61462076)

An interpolation based outlier detection
method of sparse high-dimensional data

CHEN Wang-hu,TIAN Zhen,ZHANG Li-zhi,LIANG Xiao-yan,GAO Ya-qiong   

  1. (College of Computer Science & Engineering,Northwest Normal University,Lanzhou 730070,China)
  • Received:2019-04-22 Revised:2019-12-11 Online:2020-06-25 Published:2020-06-25

摘要:

离群点检测问题中的数据可被看作是正常点与异常点在空间中的高度混合,在减少正常点损失的前提下,离群点通常包含在离聚类中心最远的样本集中。受这种思想启发,提出一种针对高维稀疏数据的基于插值的离群点检测方法,该方法在K-means基础上应用遗传算法对原始数据进行插值处理,解决了K-means聚类中稀疏数据容易被合并的问题。实验结果表明,对比基于传统K-means聚类的离群点检测方法以及几种典型的基于改进K-means的检测方法,本文
方法损失的正常点更少,提高了检测的准确率和精确率。

关键词: 稀疏数据, 离群点检测, 插值, 聚类, 遗传算法

Abstract:

The data in the outlier detection problem can be considered as the mixture of normal and abnormal points in a space. Under the premise of reducing the loss of normal points, outliers are usually contained in the sample sets farthest from all clustering centroids. Inspired by this idea, this paper proposes an interpolation-based outlier detection method for sparse high-dimensional data. This method interpolates the original data by applying genetic algorithm on the basis of k-means clustering, solving the problem that sparse data in k-means clustering is easy to be merged. Experimental results show that, compared with traditional outlier detection methods based on k-means clustering and several typical detection methods based on improved k-means clustering, the proposed method can not only lose fewer normal points, but also improve the accuracy and precision of detection.
 

Key words: sparse data, outlier detection, interpolation, clustering, genetic algorithm