• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于Isolation Forest的并行化异常探测设计

侯泳旭1,段磊1,2,秦江龙3,秦攀1,唐常杰1   

  1. (1.四川大学计算机学院,四川 成都 610065;2.四川大学华西公共卫生学院,四川 成都 610041;
    3.云南大学软件学院,云南 昆明 650091)
  • 收稿日期:2016-09-11 修回日期:2016-11-05 出版日期:2017-02-25 发布日期:2017-02-25
  • 基金资助:

    国家自然科学基金(61572332,61379032);中国博士后科学基金特别资助(2016T90850);中央高校基本科研业务费(2016SCU04A22)

Parallel anomaly detection based on Isolation Forest

HOU Yongxu1,DUAN Lei1,2,QIN Jianglong3,QIN Pan1,TANG Changjie1
  

  1. (1.School of Computer Science,Sichuan University,Chengdu 610065;
    2.West China School of Public Health,Sichuan University,Chengdu 610041;
    3.School of Software,Yunnan University,Kunming 650091,China) 
  • Received:2016-09-11 Revised:2016-11-05 Online:2017-02-25 Published:2017-02-25

摘要:

异常探测具有广泛的应用,受到了工业界和学术界的共同关注。在众多异常探测方法中,Isolation Forest算法具有执行效率高、探测准确度好的特点,获得了众多应用。但是,传统Isolation Forest算法难以处理大规模数据。为解决此问题,设计了一种基于云计算平台的算法。具体地,使用Hadoop分布式存储系统和MapReduce分布式计算框架设计并实现了基于Isolation Forest的并行化异常探测算法PIFH。通过将探测模型构建和数据异常评价的过程并行化,提升了PIFH算法探测异常的执行效率,扩展了其应用范围。利用真实世界数据集验证了所提算法的执行效率和可扩展性。

关键词: 异常探测, 云计算, 并行化

Abstract:

Anomaly detection, which is used in a variety of applications, attracts attention both in industry and academia. Among numerous methods for anomaly detection, the Isolation Forest algorithm, whose characteristics include high efficiency, sound detection accuracy, has wide realworld applications. However, the conventional Isolation forest algorithm can hardly deal with largescale data sets. To break this limitation, we propose a cloud computing platform based algorithm. Specifically, we design and implement a parallel algorithm for anomaly detection based on Isolation Forest, named PIFH,using the Hadoop distributed storage system and the MapReduce distributed computational framework. By parallelizing the processes of detection model construction and anomaly evaluation, its efficiency is improved, and the application range is also extended. Experiments using realworld data sets demonstrate that the proposed algorithm is efficient and scalable.
 

Key words: anomaly detection, cloud computing, parallelization