• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (8): 101-103.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • 上一篇    下一篇

关于中文文档复制检测算法的改进

孙〓伟,邢长征   

  1. (辽宁工程技术大学,辽宁 葫芦岛 125105)
  • 收稿日期:2009-05-25 修回日期:2009-09-14 出版日期:2010-07-25 发布日期:2010-07-28
  • 作者简介:孙伟(1983),女,山东曹县人,硕士生,研究方向为数据挖掘;邢长征,教授,研究方向为数据库理论与应用、数据挖掘。

An Improved Copy Detection Algorithm for the Chinese Documents

SUN Wei,XING Changzheng   

  1. (Liaoning Technical University,Huludao 125105,China)
  • Received:2009-05-25 Revised:2009-09-14 Online:2010-07-25 Published:2010-07-28

摘要:

文本复制检测是这样一种行为:它判断一个文档的内容是否抄袭、剽窃或者复制于另外一个或者多个文档。文档复制检测领域的算法有很多,基于句子相似度的检测算法结合了基于字符串比较的方法和基于词频统计的方法的优点,在抓住了文档的全局特征的同时又能兼顾文档的结构信息,是一种很好的算法。本文在该算法的基础上对相似度算法进行了改进,提出了一种新的面向中文文档的基于句子相似度的文档复制检测算法。本算法充分考虑了中文文档的特点,选择句子作为文档的特征单元, 并解决了需要人工设定阈值的问题,提高了检测精度。实验证明,无论是在效率上,还是在准确性上,该算法都是可行的。

关键词: 中文文档, 复制检测, 中文分词, 句子相似度

Abstract:

Document copy detection is such a behaviour which judge whether a document is cribbed from another or some other documents. There are many algorithms in this domain. The algorithms based on the similarity of the sentences is a good one, which not only emphasizes on the whole document, but also pays attention to the structure of the document. In the paper, the authors improve the similarity algorithm based on it, and provide a new algorithm which aims to check the Chinese documents. Our algorithm use sentence as the basic item of a document, make some improvement to the old methods. The algorithm solves the artificial problem of threshold setting and improves the detection accuracy, and the result of experiments shows that it is feasible.

Key words: Chinese document;copy detection;Chinese word segmentation;sentence similarity