• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    

基于SPARK与随机森林的短信诈骗用户识别研究

杨杰超,许江淳,岳秋燕,曾德斌,陆万荣   

  1. (昆明理工大学信息工程与自动化学院,云南 昆明 650500)
  • 收稿日期:2018-08-13 修回日期:2018-12-24 出版日期:2019-06-25 发布日期:2019-06-25

SMS scam user identification based
on SPARK and random forest

YANG Jiechao,XU Jiangchun,YUE Qiuyan,ZENG Debin,LU Wanrong   

  1. (Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2018-08-13 Revised:2018-12-24 Online:2019-06-25 Published:2019-06-25

摘要:

当今数据时代电信诈骗现象日益增多,为了在短信诈骗份子实施诈骗前及时识别出其身份,根据目前电信行业需求及研究现状,在SPARK并行处理框架上,针对性地提出了分层子空间的加权随机森林算法。面对短信用户种类繁杂导致的数据类别不平衡带来的随机森林性能低下的问题,采用改进的分层子空间的方法,并根据评估出的每棵树的分类能力给决策树加权,相较于其他分类算法,改进的随机森林表现得更优异;针对电信行业海量数据的特点,选择分布式SPARK作为数据处理平台,并行化的平台缩短了模型训练和测试时间,提高了效率,实时、准确地识别电信短信诈骗用户,其准确率达到90%以上。
 

关键词: SPARK, 随机森林, 分层子空间, 加权, 短信诈骗用户识别

Abstract:

SMS scams are increasing in today’s data era. In order to identify SMS scam users before they commit fraud, we propose a weighted random forest algorithm of hierarchical subspace in the SPARK parallel processing framework according to the current telecom industry demand and research status. Aiming at the problem of low performance of the random forest caused by unbalanced data categories due to the variety of SMS users, we adopt an improved hierarchical subspace method, and weigh the decision tree according to the evaluation of the classification ability of each tree. Our proposal outperforms other classification algorithms. Given the characteristics of massive data in the telecom industry, we select the distributed SPARK as the data processing platform. The parallelized platform not only improves the efficiency of the algorithm, but also reduces training time and testing time of the model. It can identify telecom SMS scam users in real time accurately, and its accuracy rate is over 90%.
 

Key words: SPARK, random forest, hierarchical subspace, weighted, SMS scam user identification