• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science

Previous Articles     Next Articles

A data quality assessment method of
Web articles based on PAC-Bayes theory

TANG Li,HE Li   

  1. (Department of Information Science and Technology,School of Science and Technology,
    Tianjin University of Finance and Economics,Tianjin 300222,China)
     
  • Received:2015-05-15 Revised:2015-11-26 Online:2017-03-25 Published:2017-03-25

Abstract:

We propose an assessment index system and a method based on the PAC-Bayes theory for better data quality assessment of Web articles. Making full use of prior information of samples, the PAC-Bayes theory integrates the theories of Probably Approximately Correct and the Bayesian paradigm, and derives the tightest generalization bounds to assess the generalization capability of classifiers. We analyze the research status of data quality assessment of articles in detail, and then introduce the theoretical framework of the PAC-Bayes theory and its application for SVM. Furthermore, we propose a method for data quality assessment of Web articles based on the PAC-Bayes theory (DQAPB), and apply the SVM algorithm and its PAC-Bayes bound to the data quality assessment of Web articles. Moreover, we establish a quality assessment index system of Web articles based on the PAC-Bayes theory. Experiments on Wikipedia document show that the proposed method is simple and fast with strong stability and robustness.

Key words: PAC-Bayes bound, support vector machine (SVM), generalization capability, data quality assessment