关于中文文档复制检测算法的改进

孙〓伟，邢长征

doi:10.3969/j.issn.1007130X.2010.

计算机工程与科学 >

2010 , Vol. 32 >Issue 8: 101 - 103

DOI: https://doi.org/10.3969/j.issn.1007130X.2010.

论文

关于中文文档复制检测算法的改进

展开

（辽宁工程技术大学，辽宁葫芦岛 125105）

孙伟（1983），女，山东曹县人，硕士生，研究方向为数据挖掘；邢长征，教授，研究方向为数据库理论与应用、数据挖掘。

收稿日期: 2009-05-25

修回日期: 2009-09-14

网络出版日期: 2010-07-28

收起

An Improved Copy Detection Algorithm for the Chinese Documents

Expand

（Liaoning Technical University，Huludao 125105,China）

Received date: 2009-05-25

Revised date: 2009-09-14

Online published: 2010-07-28

Fold

摘要

文本复制检测是这样一种行为：它判断一个文档的内容是否抄袭、剽窃或者复制于另外一个或者多个文档。文档复制检测领域的算法有很多，基于句子相似度的检测算法结合了基于字符串比较的方法和基于词频统计的方法的优点，在抓住了文档的全局特征的同时又能兼顾文档的结构信息，是一种很好的算法。本文在该算法的基础上对相似度算法进行了改进，提出了一种新的面向中文文档的基于句子相似度的文档复制检测算法。本算法充分考虑了中文文档的特点,选择句子作为文档的特征单元, 并解决了需要人工设定阈值的问题，提高了检测精度。实验证明,无论是在效率上，还是在准确性上，该算法都是可行的。

关键词： 中文文档; 复制检测; 中文分词; 句子相似度

本文引用格式

孙〓伟，邢长征 . 关于中文文档复制检测算法的改进[J]. 计算机工程与科学, 2010 , 32(8) : 101 -103 . DOI: 10.3969/j.issn.1007130X.2010.

Abstract

Document copy detection is such a behaviour which judge whether a document is cribbed from another or some other documents. There are many algorithms in this domain. The algorithms based on the similarity of the sentences is a good one, which not only emphasizes on the whole document, but also pays attention to the structure of the document. In the paper, the authors improve the similarity algorithm based on it， and provide a new algorithm which aims to check the Chinese documents. Our algorithm use sentence as the basic item of a document, make some improvement to the old methods. The algorithm solves the artificial problem of threshold setting and improves the detection accuracy, and the result of experiments shows that it is feasible.

Key words： Chinese document;copy detection;Chinese word segmentation;sentence similarity

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract