• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (04): 699-706.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于字段过滤和伸缩窗口的SNM算法优化

周世杰,娄渊胜   

  1. (河海大学计算机与信息学院,江苏 南京 211100) 
  • 收稿日期:2020-12-18 修回日期:2021-01-26 接受日期:2022-04-25 出版日期:2022-04-25 发布日期:2022-04-20
  • 基金资助:
    江苏省重点研发计划(BE2018301)

SNM algorithm optimization based on field filtering and scaling window

ZHOU Shi-jie,LOU Yuan-sheng   

  1. (College of Computer and Information,Hohai University,Nanjing 211100,China)
  • Received:2020-12-18 Revised:2021-01-26 Accepted:2022-04-25 Online:2022-04-25 Published:2022-04-20

摘要: 数据仓库中的问题数据对数据质量有较大的影响,为了查找和去除这些问题数据,首要的工作是处理相似重复数据,目前针对重复数据清除应用最多的算法是基本邻近排序算法(SNM)。通过分析SNM算法的缺陷,提出了一种改进的SNM算法——ISNM。采用属性区分法计算属性权值,解决了人为主观赋予权值导致的问题;使用字段过滤算法计算2条记录的相似度,减少了窗口内记录属性的比对次数,加快了算法的检测速度;使用可变窗口代替固定大小的窗口,防止记录漏配并减少无用的记录比对。实验结果表明,改进后的ISNM算法在查全率、查准率和运行时间开销上有明显的优势。

关键词: 数据质量, 数据清洗, 相似重复记录, SNM算法

Abstract: The problematic data in the data warehouse has a great impact on data quality. In order to find and delete these problematic data, the primary work is the processing of similar repeated data. Currently, the most widely used algorithm for deduplication is the sorted-neighborhood method (SNM). After  analyzing the shortcomings of this algorithm, an improved SNM algorithm (ISNM) is proposed. The attribute weights are calculated using the attribute discrimination method, which solves the subjectivity caused by artificial weights. The field filtering algorithm is used to calculate the similarity of two records, which reduces the number of comparisons of record attributes in the window and accelerates the detection speed of the algorithm. Variable windows are used instead of fixed-size windows to prevent missing records and reduce useless record comparisons. Experimental results show that ISNM algorithm has obvious advantages in terms of recall, precision and running time overhead.


Key words: data quality, data cleaning, similar duplicate records, SNM algorithm