基于插值的高维稀疏数据离群点检测方法

计算机工程与科学

基于插值的高维稀疏数据离群点检测方法

陈旺虎，田真，张礼智，梁小燕，高雅琼

（西北师范大学计算机科学与工程学院，甘肃兰州 730070）

收稿日期:2019-04-22 修回日期:2019-12-11 出版日期:2020-06-25 发布日期:2020-06-25
基金资助:
国家自然科学基金（61967013,61462076）

An interpolation based outlier detection

method of sparse high-dimensional data

CHEN Wang-hu,TIAN Zhen,ZHANG Li-zhi,LIANG Xiao-yan,GAO Ya-qiong

（College of Computer Science & Engineering,Northwest Normal University,Lanzhou 730070,China）

Received:2019-04-22 Revised:2019-12-11 Online:2020-06-25 Published:2020-06-25

摘要/Abstract

摘要：

离群点检测问题中的数据可被看作是正常点与异常点在空间中的高度混合，在减少正常点损失的前提下，离群点通常包含在离聚类中心最远的样本集中。受这种思想启发，提出一种针对高维稀疏数据的基于插值的离群点检测方法，该方法在K-means基础上应用遗传算法对原始数据进行插值处理，解决了K-means聚类中稀疏数据容易被合并的问题。实验结果表明，对比基于传统K-means聚类的离群点检测方法以及几种典型的基于改进K-means的检测方法，本文
方法损失的正常点更少，提高了检测的准确率和精确率。

关键词: 稀疏数据, 离群点检测, 插值, 聚类, 遗传算法

Abstract:

The data in the outlier detection problem can be considered as the mixture of normal and abnormal points in a space. Under the premise of reducing the loss of normal points, outliers are usually contained in the sample sets farthest from all clustering centroids. Inspired by this idea, this paper proposes an interpolation-based outlier detection method for sparse high-dimensional data. This method interpolates the original data by applying genetic algorithm on the basis of k-means clustering, solving the problem that sparse data in k-means clustering is easy to be merged. Experimental results show that, compared with traditional outlier detection methods based on k-means clustering and several typical detection methods based on improved k-means clustering, the proposed method can not only lose fewer normal points, but also improve the accuracy and precision of detection.

Key words: sparse data, outlier detection, interpolation, clustering, genetic algorithm

陈旺虎，田真，张礼智，梁小燕，高雅琼. 基于插值的高维稀疏数据离群点检测方法[J]. 计算机工程与科学.

CHEN Wang-hu,TIAN Zhen,ZHANG Li-zhi,LIANG Xiao-yan,GAO Ya-qiong.

An interpolation based outlier detection

method of sparse high-dimensional data

[J]. Computer Engineering & Science.

编辑推荐

Metrics

阅读次数

全文

261

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	261	0	0

来源	本网站	其他网站

次数	211	50
比例	81%	19%

摘要

165

最新录用	在线预览	正式出版

165	0	0

	来源	本网站

	次数	165
	比例	100%

[1]	刘合兵, 孔玉杰, 席磊, 尚俊平. 融合注意力机制的解耦对比聚类[J]. 计算机工程与科学, 2024, 46(12): 2261-2270.
[2]	彭频, 王欣悦. 基于匮乏理论的应急物资调度模型构建及算法研究[J]. 计算机工程与科学, 2024, 46(11): 2063-2070.
[3]	安园园, 马晓宁. 改进遗传算法与多目标优化模型的航班路径规划[J]. 计算机工程与科学, 2024, 46(09): 1660-1666.
[4]	李猛, 刘姿邑, 宋宇航. 基于双重自表达与最大熵原理的深度子空间聚类算法[J]. 计算机工程与科学, 2024, 46(09): 1685-1692.
[5]	柴旭清, 乔一航, 范黎林, . 一种基于随机森林分类器构建高性能应用程序性能分析模型的方法[J]. 计算机工程与科学, 2024, 46(07): 1218-1228.
[6]	李成冉, 方佳豪, 尹首一, 魏少军, 胡杨. 基于遗传算法的晶圆级芯片映射算法研究[J]. 计算机工程与科学, 2024, 46(06): 993-1000.
[7]	任晟岐, 宋伟. 基于GGInformer模型的多维时间序列特征提取与预测研究[J]. 计算机工程与科学, 2024, 46(04): 590-598.
[8]	王中昊, 夏竟, 李世杰, 蔡志平. 基于双调和插值的锥束CT金属伪影校正算法[J]. 计算机工程与科学, 2024, 46(03): 471-478.
[9]	宋鑫海, 韩京宇, 郎杭, 毛毅. 滑动窗口投票策略的QRS波群形态识别[J]. 计算机工程与科学, 2024, 46(02): 272-281.
[10]	钟卓辉, 陈黎飞, . 基于模型的非凸聚类算法[J]. 计算机工程与科学, 2024, 46(02): 292-302.
[11]	肖振国, 陈林书, 孙少杰, 梅本霞, 柳媛慧, 赵磊. 基于代数粒的聚类方法[J]. 计算机工程与科学, 2024, 46(01): 150-158.
[12]	孙睿男, 初翔, 陈昱, 闫明宁. 基于混合启发式算法的快递末端选址路径优化研究[J]. 计算机工程与科学, 2024, 46(01): 159-169.
[13]	周小华, 王学志, 周园春, 孟珍, . 面向大区域碳卫星数据的分布式Kriging插值算法优化[J]. 计算机工程与科学, 2023, 45(11): 1911-1921.
[14]	王若宾, 耿芳东, 张永梅, 宋威, 王伟锋, 徐琳. 基于改进自适应DBSCAN的混合式MOOC视频观看模式挖掘[J]. 计算机工程与科学, 2023, 45(09): 1670-1678.
[15]	郭艺, 何廷年, 李爱斌, 毛君宇. 融合GA-CART和Deep-IRT的知识追踪模型[J]. 计算机工程与科学, 2023, 45(09): 1691-1700.