基于神经网络编码的真值发现

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (9): 1546-1557.

基于神经网络编码的真值发现

曹建军1,常宸1,2,翁年凤1,陶嘉庆1,3,江春1

(1.国防科技大学第六十三研究所,江苏南京 210007;2.陆军工程大学指挥控制工程学院,江苏南京 210007;

3.南京工业大学工业工程系,江苏南京 210009)

收稿日期:2020-09-10 修回日期:2020-11-20 出版日期:2021-09-25 发布日期:2021-09-26
基金资助:
国家自然科学基金 (61371196);中国博士后科学基金(20090461425,201003797);国家重大科技专项(2015ZX01040201-003)

Truth discovery based on neural network encoding

CAO Jian-jun1,CHANG Chen1,2,WENG Nian-feng1,TAO Jia-qing1,3,JIANG Chun1

(1.The Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007;

2.Institute of Command and Control Engineering,Army Engineering University,Nanjing 210007;

3.Department of Industrial Engineering,Nanjing University of Technology,Nanjing 210009,China)

Received:2020-09-10 Revised:2020-11-20 Online:2021-09-25 Published:2021-09-26

摘要/Abstract

摘要： 由于互联网的开放性和多源性，不同互联网平台提供的数据参差不齐，多个数据源对同一实体的描述可能存在冲突，真值发现是消解语义冲突，提高数据质量的重要技术手段之一。传统真值发现算法通常假设数据源可靠度与观测值可信度间的关系可用简单函数表示，设计迭代规则或概率模型进行真值发现，而人工定义的条件通常难以反映数据底层的真实分布，导致真值发现结果不理想。针对此问题，提出基于神经网络编码的真值发现方法TDNNE。首先利用“数据源-数据源”“数据源-观测值”关系及真值发现的假设构造双损失深度神经网络；然后利用该网络将数据源与观测值嵌入到高维空间，分别表示数据源可靠度与观测值可信度，使可靠数据源与可信观测值彼此接近（同时，不可靠数据源与不可信观测值彼此接近）；最后基于嵌入空间进行真值发现。与传统方法相比，TDNNE方法不需要人工定义迭代规则或数据分布，而是利用神经网络自动学习数据源观测值间复杂的关系依赖。在真实数据集上的实验结果表明，该方法准确率较基于迭代的Accu等方法准确率提高约2%~25%，较基于概率图模型的3-Estimates等方法提高约2%~4%，较基于优化的CRH方法提高约2%~5%，较基于神经网络的FFMN方法提高约1%~2%。

关键词: 数据质量, 数据清洗, 冲突消解, 真值发现, 神经网络

Abstract: Due to the openness and diversity of the Internet, different platforms provide different quality information，and the descriptions of the same object can be conflict with each other. Truth discovery is one of the important technical means to resolve semantic conflicts and improve the data quality. Traditional truth discovery methods usually assume that the relationship between source reliability and claim credibility can be represented by a simple function. These methods design iterative rules or probability models to find trustworthy claims and sources. However, manually-defined factors are often difficult to reflect the real underlying distribution of the data, resulting in an unsatisfied truth discovery result. In order to solve this problem, a truth discovery method based on neural network encoding is proposed. Firstly, the method constructs a double-loss deep neural network which contains “source-source” and “source-claim” relationships. Secondly, it embeds the sources and claim into a low-dimensional space, which indicates the source reliability and claim credibility. Based on the optimization, the reliable sources and the trustworthy claims are close in the embedding space (meanwhile, unreliable sources and untrustworthy claims). Finally, truth discovery is performed based on the embedding space. Compared with traditional methods, it is not necessary for the proposed method to manually define the iterative rules or data distribution before truth discovery. The method utilizes the neural network to automatically learn the complex relationships among sources and claims, and then embeds them into a low- dimensional space. The experimental results on the real dataset show that the proposed model increases the precision by 2%~25% in comparison to the iterative based methods such as Accu, by 2%~4% in comparison to the probabilistic graphical model based methods such as 3-Estimate, by 2%~5% in comparison to the optimization based method such as CRH, and by 1%~2% in comparison to the neural network based method FFMN.

Key words: data quality, data cleaning, conflict resolution, truth discovery, neural network

曹建军, 常宸, 翁年凤, 陶嘉庆, 江春. 基于神经网络编码的真值发现[J]. 计算机工程与科学, 2021, 43(9): 1546-1557.

CAO Jian-jun, CHANG Chen, WENG Nian-feng, TAO Jia-qing, JIANG Chun. Truth discovery based on neural network encoding[J]. Computer Engineering & Science, 2021, 43(9): 1546-1557.

[1]	刘金竹, 张东, 李冠宇. 基于密集卷积和多特征感知的链接预测模型研究[J]. 计算机工程与科学, 2025, 47(8): 1483-1492.
[2]	高志玲1, 赵新宇1, 2. 基于PKUSEG-Text-GCN的肿瘤疾病预测模型[J]. 计算机工程与科学, 2025, 47(7): 1303-1311.
[3]	陈旭, 陈子雄, 景永俊, 王叔洋, 宋吉飞. 基于双曲图卷积神经网络的切片级漏洞检测方法[J]. 计算机工程与科学, 2025, 47(5): 851-863.
[4]	王莹, 杨青, 王翔宇, 张勇, . 基于非对称空间特征的脑电信号情感分析研究[J]. 计算机工程与科学, 2025, 47(5): 921-930.
[5]	李珍琪, 王强, 齐星云, 赖明澈, 赵言亢, 陆亿行, 黎渊. 轻量化卷积神经网络硬件加速设计及FPGA实现[J]. 计算机工程与科学, 2025, 47(4): 582-591.
[6]	王煜恒, 刘强, 伍晓洁. RCGNN：图注入攻击下的图神经网络鲁棒性认证方法[J]. 计算机工程与科学, 2025, 47(3): 434-447.
[7]	景永俊, 王浩, 邵堃, 王晓峰. 一种基于图热核扩散卷积的网络入侵检测方法[J]. 计算机工程与科学, 2025, 47(3): 459-471.
[8]	李娇, 高磊怡, 张瑞欣, 吴越, 邓红霞. 基于脉冲注意力机制的轻量化面部超分重建方法[J]. 计算机工程与科学, 2025, 47(3): 494-503.
[9]	陈宇灵, 李翔. 基于图结构提示实现低资源场景下的节点分类[J]. 计算机工程与科学, 2025, 47(3): 534-547.
[10]	黄颖, 唐敏, . 基于深度神经网络的隐私保护基因检测[J]. 计算机工程与科学, 2025, 47(2): 265-275.
[11]	侯萱, 梁志贞, 张磊, 刘佰龙, 张雪飞. 基于上下文全局空间图的轨迹用户链接[J]. 计算机工程与科学, 2025, 47(2): 336-348.
[12]	朱嘉骏, 包美凯, 张凯, 刘烨, 刘淇. 基于多源知识注入的常识问答方法研究[J]. 计算机工程与科学, 2025, 47(2): 349-360.
[13]	李瑞红, 李晓红, 姚锦, 王闪闪. 基于双通道异质超图神经网络的引文推荐方法[J]. 计算机工程与科学, 2025, 47(2): 361-369.
[14]	王鹏, 张嘉诚, 范毓洋, . 适应于硬件部署的神经网络剪枝量化算法[J]. 计算机工程与科学, 2024, 46(9): 1547-1553.
[15]	袁佳伟, 赵进. 基于图神经网络的OMCI模型相似性计算[J]. 计算机工程与科学, 2024, 46(9): 1576-1586.