• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于大数据的异常检测方法研究

杨先圣1,姜磊1,彭雄2,周倩1,刘菊君1   

  1. (1.湖南科技大学知识处理与网格化制造湖南省普通高校重点实验室,湖南 湘潭 411201;
    2.步步高商业连锁股份有限公司,湖南 湘潭 411000)
  • 收稿日期:2018-01-16 修回日期:2018-04-01 出版日期:2018-07-25 发布日期:2018-07-25

A new outlier detection method based on large data

YANG Xiansheng1,JIANG Lei1,PENG Xiong2,ZHOU Qian1,LIU Jujun1   

  1. (1.Key Laboratory of Knowledge Processing & Network Manufacturing,
    Hunan University of Science & Technology,Xiangtan 411201;
    2.BBK Commercial Chain Co.,Ltd,Xiangtan 411000,China)
     
     
  • Received:2018-01-16 Revised:2018-04-01 Online:2018-07-25 Published:2018-07-25

摘要:

离群数据检测,主要目的是从海量数据中发现异常数据。其有以下两点好处:第一,作为数据预处理工作,减少噪声点对模型的影响;第二,针对特定场景检测出异常,并对异常现象本身进行挖掘,也非常有价值。目前,国内外主流的方法像LOF、KNN、ORCA等,无法兼顾全局离群点、局部离群点和离群簇同时存在的复杂场景的检测。
针对这一情况,提出了一种新的离群数据检测模型。为了能够最大限度对全局、局部离群数据以及离群簇的全面检测,基于iForest、LOF、DBSCAN分别对于全局离群点、局部离群点、离群簇的高度敏感度,选定该三种特定基分类器,并且改变其目标函数,修正框架的错误率计算方式,进行融合,形成了新的离群数据检测模型ILD-BOOST。实验结果表明,该模型充分兼顾了全局和局部离群数据及离群簇的检测,且效果优于目前主流的离群数据检测方法。

关键词: 离群数据检测, 模型融合, 商业大数据, 提升框架

Abstract:

Outlier detection, whose aim is to find abnormal data from the massive data, has two advantages. First, as a way of data preprocessing, it can reduce the impact of noise on the model. Second, in a specific scene, it can find outliers accurately and analyze the abnormal phenomenon. At present, domestic and foreign mainstream methods, such as KNN and ORCA etc., do not take the global outliers, local outliers and outlier cluster into account, and it is difficult for them to deal with largescale data sets. Based on the Spark platform, we propose a new outlier detection model. In order to maximize the overall detection results, iForest, LOF, and DBSCAN are used respectively for their high sensitivity. First, the three specific base classifiers are selected, and their object functions are changed. Then, the error rate calculation method of the framework is modified, improved and merged to form a new outlier detection model,called ILDBOOST. The results show that the model fully takes into account the detection of global, local outliers and outlier cluster, which improves the precision and recall rate as a whole, and the effect is obviously better than the current mainstream outliers detection methods.
 

Key words: outlier detection, blending and stacking, business big data, boosting frame