• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 论文 • 上一篇    下一篇

基于PAC-Bayes理论的Web文档数据质量评估方法

汤莉,何丽   

  1. (天津财经大学理工学院信息科学与技术系,天津 300222)
  • 收稿日期:2015-05-15 修回日期:2015-11-26 出版日期:2017-03-25 发布日期:2017-03-25
  • 基金资助:

    天津市自然科学基金(15JCYBJC16000);教育部人文社会科学研究一般项目(14YJA630025);天津市社会科学基金(TJYY15-017);国家自然科学基金(61502331)

A data quality assessment method of
Web articles based on PAC-Bayes theory

TANG Li,HE Li   

  1. (Department of Information Science and Technology,School of Science and Technology,
    Tianjin University of Finance and Economics,Tianjin 300222,China)
     
  • Received:2015-05-15 Revised:2015-11-26 Online:2017-03-25 Published:2017-03-25

摘要:

为了更好地评估Web文档数据质量,提出一种基于PAC-Bayes理论的Web文档质量评估指标体系和评估方法。PAC-Bayes理论融合了PAC理论和贝叶斯定理,在充分利用样本先验信息的基础上,推导出了最紧的泛化风险边界,用于衡量学习算法的泛化性能。首先阐述了文档数据质量评估的研究现状,介绍了PAC-Bayes理论框架及其在支持向量机上的应用;其次提出一种基于PAC-Bayes理论的Web文档数据质量评估方法(DQAPB),将SVM算法及其PAC-Bayes边界应用于Web文档的质量评价中,并构建了基于PAC-Bayes理论的Web文档质量评估指标体系;最后采用Wikipedia文档进行实验,实验结果表明该方法具有简便快速、稳定性和鲁棒性较强的优点。

关键词: PAC-Bayes边界, 支持向量机, 泛化能力, 数据质量评估

Abstract:

We propose an assessment index system and a method based on the PAC-Bayes theory for better data quality assessment of Web articles. Making full use of prior information of samples, the PAC-Bayes theory integrates the theories of Probably Approximately Correct and the Bayesian paradigm, and derives the tightest generalization bounds to assess the generalization capability of classifiers. We analyze the research status of data quality assessment of articles in detail, and then introduce the theoretical framework of the PAC-Bayes theory and its application for SVM. Furthermore, we propose a method for data quality assessment of Web articles based on the PAC-Bayes theory (DQAPB), and apply the SVM algorithm and its PAC-Bayes bound to the data quality assessment of Web articles. Moreover, we establish a quality assessment index system of Web articles based on the PAC-Bayes theory. Experiments on Wikipedia document show that the proposed method is simple and fast with strong stability and robustness.

Key words: PAC-Bayes bound, support vector machine (SVM), generalization capability, data quality assessment