• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于稀疏重构权的错误标注数据检测方法

吴敬生,王靖,杜吉祥   

  1. (华侨大学计算机科学与技术学院,福建 厦门 361021)
  • 收稿日期:2016-01-21 修回日期:2016-05-17 出版日期:2017-11-25 发布日期:2017-11-25
  • 基金资助:

    国家自然科学基金(61370006);福建省自然科学基金(2014J01237);福建省教育厅科技项目(JA12006);福建省高等学校新世纪优秀人才支持计划 (2012FJ-NCET-ZR01);华侨大学中青年教师科技创新计划(ZQN-PY116);华侨大学研究生科研创新能力培育计划(1400214005)

A mislabeled data detection method
 based on sparse reconstruction weights
 
#br#  

WU Jing-sheng,WANG Jing,DU Ji-xiang   

  1. (School of Computer Science and Technology,Huaqiao University,Xiamen 361021,China)
  • Received:2016-01-21 Revised:2016-05-17 Online:2017-11-25 Published:2017-11-25

摘要:

数据分类的准确性依赖于数据标注的质量和数量,当训练数据被错误标注时,数据分类的准确性会受到很大的影响。针对这种情形,提出一种基于稀疏重构权的错误标注数据检测方法。首先,对含有错误标注数据集采用k近邻的方法求取其近邻点;然后,通过求解带L1-范数的最小二乘模型计算每个标注数据的局部稀疏重构权,并利用稀疏重构权计算每个标注数据的置信度;最后,通过寻找置信度曲线中最大曲率的位置,自适应地检测出错误标注数据。通过实际数据的实验验证了本文所提算法的有效性。

关键词: 稀疏重构权, 错误标注, 置信度, 检测

Abstract:

The accuracy of data classification depends on the quality and quantity of labeled data. When training data is mislabeled, data classification accuracy is greatly affected. In view of this situation, we propose a detection method based on the sparse reconstruction weights for erroneous labeling data. Firstly, we apply the k-nearest neighbor method to search their neighbor points for the training data that contains wrong labels. Each local sparse reconstruction weight can be calculated by solving the LS model with L1-norm. Secondly, we use parse reconstruction weights to calculate the label confidence level of every labeled data. Finally, by finding the position of the maximum curvature on the confidence curve, this method can adaptively detect the mislabeled data. Experiments on real data demonstrate that the proposed algorithm is effective.

Key words: sparse reconstruction weight, mislabeled, confidence level, detection