• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2021, Vol. 43 ›› Issue (04): 670-680.

• 软件工程 • 上一篇    下一篇

融合文本分布式表示的重复缺陷报告检测

曾杰,贲可荣,张献,徐永士   

  1. (海军工程大学电子工程学院,湖北 武汉 430033)
  • 收稿日期:2020-01-06 修回日期:2020-06-18 接受日期:2021-04-25 出版日期:2021-04-25 发布日期:2021-04-21

Duplicate bug report detection by combining distributed representations of documents#br#

ZENG Jie,BEN Ke-rong,ZHANG Xian,XU Yong-shi#br#

#br#
  

  1. (College of Electronic Engineering,Naval University of Engineering,Wuhan 430033,China)
  • Received:2020-01-06 Revised:2020-06-18 Accepted:2021-04-25 Online:2021-04-25 Published:2021-04-21

摘要: 重复缺陷报告检测能够避免对描述同一缺陷的多份报告进行重复的任务分派和修复,可降低软件维护成本。为了进一步提高检测的准确率,提出一种融合文本分布式表示的重复缺陷报告检测方法。首先,基于大规模缺陷报告数据库训练Doc2Vec模型并抽取缺陷报告的分布式表示,将不同长度的缺陷报告编码为统一长度的稠密向量。接着,通过比较这些向量来计算不同缺陷报告的相似程度,将其作为一种新特征与重复缺陷报告检测过程常用的其它特征进行融合,并利用机器学习算法训练二元分类模型。在公开的Bugzilla重复缺陷报告数据集上的实验结果表明,相比于代表性方法D_TS,本文方法的F1值平均提升了2%,说明了新特征的有效性。


关键词: 重复缺陷报告, 文本分布式表示, Doc2Vec模型, 机器学习算法

Abstract: Duplicate bug report detection can avoid the repeated assignment and repair processes for multiple bug reports that describe the same bug, and thus greatly reduce the cost of software main- tenance. To improve the accuracy of detection, this paper proposes a duplicate bug report detection method by combining distributed representations of documents. Firstly, the Doc2Vec model is trained based on a large-scale defect report database, the distributed representations of bug reports are extracted, and the variable-sized bug reports are encoded into fixed-sized dense vectors. Secondly, the similarities between different bug reports are calculated by comparing their dense vectors, it is as a new feature and combined with traditional features commonly used in the process of duplicate bug report detection, and machine learning algorithm is used to train the binary classification model. Experimental results on public duplicate bug report datasets from Bugzilla show that, compared with the state of the art method D_TS, our method improves the F1 value by 2% on average, which indicates the effectiveness of the new feature. 


Key words: duplicate bug report, distributed representations of documents, Doc2Vec model, machine learning algorithm

中图分类号: