• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (09): 1546-1557.

• 高性能计算 • 上一篇    下一篇

基于神经网络编码的真值发现

曹建军1,常宸1,2,翁年凤1,陶嘉庆1,3,江春1   

  1. (1.国防科技大学第六十三研究所,江苏 南京 210007;2.陆军工程大学指挥控制工程学院,江苏 南京 210007;

    3.南京工业大学工业工程系,江苏 南京 210009)
  • 收稿日期:2020-09-10 修回日期:2020-11-20 接受日期:2021-09-25 出版日期:2021-09-25 发布日期:2021-09-26
  • 基金资助:
    国家自然科学基金 (61371196);中国博士后科学基金(20090461425,201003797);国家重大科技专项(2015ZX01040201-003)

Truth discovery based on neural network encoding

CAO Jian-jun1,CHANG Chen1,2,WENG Nian-feng1,TAO Jia-qing1,3,JIANG Chun1   

  1. (1.The Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007;

    2.Institute of Command and Control Engineering,Army Engineering University,Nanjing 210007;

    3.Department of Industrial Engineering,Nanjing University of Technology,Nanjing 210009,China)


  • Received:2020-09-10 Revised:2020-11-20 Accepted:2021-09-25 Online:2021-09-25 Published:2021-09-26

摘要: 由于互联网的开放性和多源性,不同互联网平台提供的数据参差不齐,多个数据源对同一实体的描述可能存在冲突,真值发现是消解语义冲突,提高数据质量的重要技术手段之一。传统真值发现算法通常假设数据源可靠度与观测值可信度间的关系可用简单函数表示,设计迭代规则或概率模型进行真值发现,而人工定义的条件通常难以反映数据底层的真实分布,导致真值发现结果不理想。针对此问题,提出基于神经网络编码的真值发现方法TDNNE。首先利用“数据源-数据源”“数据源-观测值”关系及真值发现的假设构造双损失深度神经网络;然后利用该网络将数据源与观测值嵌入到高维空间,分别表示数据源可靠度与观测值可信度,使可靠数据源与可信观测值彼此接近(同时,不可靠数据源与不可信观测值彼此接近);最后基于嵌入空间进行真值发现。与传统方法相比,TDNNE方法不需要人工定义迭代规则或数据分布,而是利用神经网络自动学习数据源观测值间复杂的关系依赖。在真实数据集上的实验结果表明,该方法准确率较基于迭代的Accu等方法准确率提高约2%~25%,较基于概率图模型的3-Estimates等方法提高约2%~4%,较基于优化的CRH方法提高约2%~5%,较基于神经网络的FFMN方法提高约1%~2%。

关键词: 数据质量, 数据清洗, 冲突消解, 真值发现, 神经网络

Abstract: Due to the openness and diversity of the Internet, different platforms provide different quality information,and the descriptions of the same object can be conflict with each other. Truth discovery is one of the important technical means to resolve semantic conflicts and improve the data quality. Traditional truth discovery methods usually assume that the relationship between source reliability and claim credibility can be represented by a simple function. These methods design iterative rules or probability models to find trustworthy claims and sources. However, manually-defined factors are often difficult to reflect the real underlying distribution of the data, resulting in an unsatisfied truth discovery result. In order to solve this problem, a truth discovery method based on neural network encoding is proposed. Firstly, the method constructs a double-loss deep neural network which contains “source-source” and “source-claim” relationships. Secondly, it embeds the sources and claim into a low-dimensional space, which indicates the source reliability and claim credibility. Based on the optimization, the reliable sources and the trustworthy claims are close in the embedding space (meanwhile, unreliable sources and untrustworthy claims). Finally, truth discovery is performed based on the embedding space. Compared with traditional methods, it is not necessary for the proposed method to manually define the iterative rules or data distribution before truth discovery. The method utilizes the neural network to automatically learn the complex relationships among sources and claims, and then embeds them into a low- dimensional space. The experimental results on the real dataset show that the proposed model increases the precision by 2%~25% in comparison to the iterative based methods such as Accu, by 2%~4% in comparison to the probabilistic graphical model based methods such as 3-Estimate, by 2%~5% in comparison to the optimization based method such as CRH, and by 1%~2% in comparison to the neural network based method FFMN.

Key words: data quality, data cleaning, conflict resolution, truth discovery, neural network