• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2022, Vol. 44 ›› Issue (04): 699-706.

• Artificial Intelligence and Data Mining • Previous Articles     Next Articles

SNM algorithm optimization based on field filtering and scaling window

ZHOU Shi-jie,LOU Yuan-sheng   

  1. (College of Computer and Information,Hohai University,Nanjing 211100,China)
  • Received:2020-12-18 Revised:2021-01-26 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

Abstract: The problematic data in the data warehouse has a great impact on data quality. In order to find and delete these problematic data, the primary work is the processing of similar repeated data. Currently, the most widely used algorithm for deduplication is the sorted-neighborhood method (SNM). After  analyzing the shortcomings of this algorithm, an improved SNM algorithm (ISNM) is proposed. The attribute weights are calculated using the attribute discrimination method, which solves the subjectivity caused by artificial weights. The field filtering algorithm is used to calculate the similarity of two records, which reduces the number of comparisons of record attributes in the window and accelerates the detection speed of the algorithm. Variable windows are used instead of fixed-size windows to prevent missing records and reduce useless record comparisons. Experimental results show that ISNM algorithm has obvious advantages in terms of recall, precision and running time overhead.


Key words: data quality, data cleaning, similar duplicate records, SNM algorithm