• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (06): 1097-1105.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于特征融合与Transformer模型的声音事件定位与检测算法研究

濮子俊,张寿明   

  1. (昆明理工大学信息工程与自动化学院,云南 昆明 650500)
  • 收稿日期:2021-08-02 修回日期:2021-12-13 接受日期:2023-06-25 出版日期:2023-06-25 发布日期:2023-06-16

A sound event localization and detection algorithm based on feature fusion and Transformer  model

PU Zi-jun,ZHANG Shou-ming   

  1. (Faculty of Information Engineering and Automation,
    Kunming University of Science and Technology,Kunming 650500,China)
  • Received:2021-08-02 Revised:2021-12-13 Accepted:2023-06-25 Online:2023-06-25 Published:2023-06-16

摘要: 针对多通道环境声音检测问题,提出了一种引入Transformer结构的特征融合网络模型TBCF-MTNN。该网络模型以对数梅尔谱和广义互相关谱作为输入,首先通过CNN和GRU获取谱的局部特征以及时间上下文关系特征,之后将2种特征图通过Cross-stitch模块进行融合,有效解决了传统网络中多特征信息无法共享的问题;然后,将融合后的特征图送入Transformer进行特征的再次采集;最终,通过全链接层输出分类和定位结果。在TAU-NIGENS 2020数据集上的实验结果表明,所提出的TBCF-MTNN网络在声音检测任务中的分类错误率能够减小至0.26;在声源定位任务中与Baseline相比较其定位误差减小至4.7°;通过和Baseline、FPN、EIN等模型相比较,结果表明所提网络具有更优的识别检测效果。

关键词: 声音事件定位与检测, 深度学习, Transformer模型, Cross-stitch;特征融合

Abstract: Aiming at the problem of multi-channel environmental sound detection, a feature fusion network model TBCF-MTNN is proposed, which introduces the Transformer structure. The network structure takes logarithmic Mel-spectrum and generalized cross-correlation spectrum as input. Firstly, the local features of the spectrum and the temporal context relationship features are obtained through CNN and GRU, and then the two feature maps are merged through the Cross-stitch module, which can effectively solve the traditional problem that multi-feature information cannot be shared in the network. Secondly, the fused feature map is sent to Transformer for re-collection of features. Finally the classification and positioning results are output through the full link layer. The verification on TAU-NIGENS 2020 data set show that, compared with the Baseline model, the TBCF-MTNN network can reduce the classification error rate to 0.26 in the sound detection task, and reduce the localization error to 4.7° in the sound source localization task. Compared with Baseline, FPN, EIN and other models, the proposed model has a better recognition effect.

Key words: sound event localization and detection, deep learning, Transformer model, Cross-stitch, feature fusion