关于中文文档复制检测算法的改进
收稿日期: 2009-05-25
修回日期: 2009-09-14
网络出版日期: 2010-07-28
An Improved Copy Detection Algorithm for the Chinese Documents
Received date: 2009-05-25
Revised date: 2009-09-14
Online published: 2010-07-28
文本复制检测是这样一种行为:它判断一个文档的内容是否抄袭、剽窃或者复制于另外一个或者多个文档。文档复制检测领域的算法有很多,基于句子相似度的检测算法结合了基于字符串比较的方法和基于词频统计的方法的优点,在抓住了文档的全局特征的同时又能兼顾文档的结构信息,是一种很好的算法。本文在该算法的基础上对相似度算法进行了改进,提出了一种新的面向中文文档的基于句子相似度的文档复制检测算法。本算法充分考虑了中文文档的特点,选择句子作为文档的特征单元, 并解决了需要人工设定阈值的问题,提高了检测精度。实验证明,无论是在效率上,还是在准确性上,该算法都是可行的。
孙〓伟,邢长征 . 关于中文文档复制检测算法的改进[J]. 计算机工程与科学, 2010 , 32(8) : 101 -103 . DOI: 10.3969/j.issn.1007130X.2010.
Document copy detection is such a behaviour which judge whether a document is cribbed from another or some other documents. There are many algorithms in this domain. The algorithms based on the similarity of the sentences is a good one, which not only emphasizes on the whole document, but also pays attention to the structure of the document. In the paper, the authors improve the similarity algorithm based on it, and provide a new algorithm which aims to check the Chinese documents. Our algorithm use sentence as the basic item of a document, make some improvement to the old methods. The algorithm solves the artificial problem of threshold setting and improves the detection accuracy, and the result of experiments shows that it is feasible.
/
| 〈 |
|
〉 |