• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

An improved Bloom Filter algorithm under
the Hadoop for duplicated web page removal

HUANG Wei-jian,YANG Hai-long   

  1. (School of Information and Electrical Engineering,Hebei University of  Engineering,Handan 056038,China)
  • Received:2015-09-10 Revised:2015-11-10 Online:2017-02-25 Published:2017-02-25

Abstract:

To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored, we propose an improved Bloom Filter algorithm, which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array. Then, the improved algorithm is  parallelized in the Hadoop distributed cluster to further improve the processing efficiency. Experimental results show that compared with traditional web duplicate removal algorithms, the improved Bloom filter algorithm can not only improve the processing efficiency of jobs, but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.

Key words: Hadoop, Bloom Filter, number of copy, MapReduce