• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A new outlier detection method based on large data

YANG Xiansheng1,JIANG Lei1,PENG Xiong2,ZHOU Qian1,LIU Jujun1   

  1. (1.Key Laboratory of Knowledge Processing & Network Manufacturing,
    Hunan University of Science & Technology,Xiangtan 411201;
    2.BBK Commercial Chain Co.,Ltd,Xiangtan 411000,China)
     
     
  • Received:2018-01-16 Revised:2018-04-01 Online:2018-07-25 Published:2018-07-25

Abstract:

Outlier detection, whose aim is to find abnormal data from the massive data, has two advantages. First, as a way of data preprocessing, it can reduce the impact of noise on the model. Second, in a specific scene, it can find outliers accurately and analyze the abnormal phenomenon. At present, domestic and foreign mainstream methods, such as KNN and ORCA etc., do not take the global outliers, local outliers and outlier cluster into account, and it is difficult for them to deal with largescale data sets. Based on the Spark platform, we propose a new outlier detection model. In order to maximize the overall detection results, iForest, LOF, and DBSCAN are used respectively for their high sensitivity. First, the three specific base classifiers are selected, and their object functions are changed. Then, the error rate calculation method of the framework is modified, improved and merged to form a new outlier detection model,called ILDBOOST. The results show that the model fully takes into account the detection of global, local outliers and outlier cluster, which improves the precision and recall rate as a whole, and the effect is obviously better than the current mainstream outliers detection methods.
 

Key words: outlier detection, blending and stacking, business big data, boosting frame