Hadoop下改进布隆过滤器算法的网页去重

计算机工程与科学

Hadoop下改进布隆过滤器算法的网页去重

黄伟建，杨海龙

（河北工程大学信息与电气工程学院,河北邯郸 056038）

收稿日期:2015-09-10 修回日期:2015-11-10 出版日期:2017-02-25 发布日期:2017-02-25
基金资助:
河北省自然科学基金（F2015402077）;河北省重点基础研究项目（14964206D）

An improved Bloom Filter algorithm under

the Hadoop for duplicated web page removal

HUANG Wei-jian,YANG Hai-long

(School of Information and Electrical Engineering,Hebei University of Engineering,Handan 056038,China)

Received:2015-09-10 Revised:2015-11-10 Online:2017-02-25 Published:2017-02-25

摘要/Abstract

摘要：

针对服务器中存储的大量重复和相似数据造成的空间浪费问题，改进的布隆过滤器（Bloom Filter）算法通过增加位数组并根据位数组的重复命中次数所计算的权重来动态优化重复数据的副本数，然后在 Hadoop 分布式集群下对改进的算法进行并行实现，以进一步提高作业处理效率。实验结果表明，与传统网页去重算法相比，改进的 Bloom Filter 算法的并行实现不仅提高了作业的处理效率，而且通过基于位数组下动态重复次数对副本数的优化，在一定程度上节省了服务器的存储空间。

关键词: Hadoop, 布隆过滤器, 副本数, MapReduce

Abstract:

To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored, we propose an improved Bloom Filter algorithm, which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array. Then, the improved algorithm is parallelized in the Hadoop distributed cluster to further improve the processing efficiency. Experimental results show that compared with traditional web duplicate removal algorithms, the improved Bloom filter algorithm can not only improve the processing efficiency of jobs, but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.

Key words: Hadoop, Bloom Filter, number of copy, MapReduce

黄伟建，杨海龙. Hadoop下改进布隆过滤器算法的网页去重[J]. 计算机工程与科学.

HUANG Wei-jian,YANG Hai-long.

An improved Bloom Filter algorithm under

the Hadoop for duplicated web page removal

[J]. Computer Engineering & Science.

[1]	苏丽，孙彦猛，张博为，杨先博，朱颖. 一种基于Hadoop+CUDA实现相关器的方法[J]. J4, 20160101, 38(01): 46-51.
[2]	薛梅婷, 俞万刚, 张纪林, 曾艳, 袁俊峰, 周丽. 一种基于动态空间划分和压缩布隆过滤器相结合的分布式元数据负载均衡算法#br#[J]. 计算机工程与科学, 2024, 46(8): 1381-1389.
[3]	寇邦艳, 曹素珍, 吕佳. 基于雾计算面向停车服务的隐私保护方案[J]. 计算机工程与科学, 2022, 44(7): 1232-1238.
[4]	赵俊生, 王鑫宇, 尹玉洁, 张林. 基于蒙古语新闻领域本体的分布式检索方法[J]. 计算机工程与科学, 2021, 43(3): 560-570.
[5]	王宇新，王飞，王冠，郭禾. 一种基于两级DAG模型的MapReduce工作流异构调度算法[J]. 计算机工程与科学, 2019, 41(8): 1353-1359.
[6]	杨青1,2,3，张亚文1,2，张琴1，袁佩玲1. 基于Hadoop的多维关联规则挖掘算法研究及应用[J]. 计算机工程与科学, 2019, 41(12): 2127-2133.
[7]	陶晓玲1,2,亢蕊楠3，刘丽燕3. 基于选择性集成的并行多分类器融合方法[J]. 计算机工程与科学, 2018, 40(5): 787-792.
[8]	王永坤1,罗萱1,金耀辉1,2. 基于私有云和物理机的混合型大数据平台设计及实现[J]. 计算机工程与科学, 2018, 40(2): 191-199.
[9]	刘鹏1,2，叶帅3，孟磊1,2，王灿4. 基于Spark的并行遗传算法求解多峰函数极值[J]. 计算机工程与科学, 2018, 40(2): 210-217.
[10]	王菁1,2，王若飞1,2. 基于日志挖掘的电商查询建议方法[J]. 计算机工程与科学, 2018, 40(2): 231-237.
[11]	肖文，胡娟，周晓峰. PFPonCanTree：一种基于MapReduce的并行频繁模式增量挖掘算法[J]. 计算机工程与科学, 2018, 40(1): 15-23.
[12]	吴云蔚，宁芊. 基于Hadoop平台的分布式SVM参数寻优[J]. 计算机工程与科学, 2017, 39(6): 1042-1047.
[13]	赵一宁,肖海力. 对于大规模系统日志的日志模式提炼算法的优化[J]. 计算机工程与科学, 2017, 39(5): 821-828.
[14]	张元鸣，陈苗，陆佳炜，徐俊，肖刚. 基于MapReduce的Bagging决策树优化算法[J]. 计算机工程与科学, 2017, 39(5): 841-848.
[15]	蔡武越1,王珂2，郝玉洁2，段晓冉2. 一种Hadoop集群下的行为异常检测方法[J]. 计算机工程与科学, 2017, 39(12): 2185-2191.