基于特征融合与Transformer模型的声音事件定位与检测算法研究

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (06): 1097-1105.

基于特征融合与Transformer模型的声音事件定位与检测算法研究

濮子俊,张寿明

(昆明理工大学信息工程与自动化学院，云南昆明 650500)

收稿日期:2021-08-02 修回日期:2021-12-13 接受日期:2023-06-25 出版日期:2023-06-25 发布日期:2023-06-16

A sound event localization and detection algorithm based on feature fusion and Transformer model

PU Zi-jun,ZHANG Shou-ming

(Faculty of Information Engineering and Automation,
Kunming University of Science and Technology,Kunming 650500,China)

Received:2021-08-02 Revised:2021-12-13 Accepted:2023-06-25 Online:2023-06-25 Published:2023-06-16

摘要/Abstract

摘要： 针对多通道环境声音检测问题，提出了一种引入Transformer结构的特征融合网络模型TBCF-MTNN。该网络模型以对数梅尔谱和广义互相关谱作为输入，首先通过CNN和GRU获取谱的局部特征以及时间上下文关系特征，之后将2种特征图通过Cross-stitch模块进行融合,有效解决了传统网络中多特征信息无法共享的问题；然后，将融合后的特征图送入Transformer进行特征的再次采集；最终，通过全链接层输出分类和定位结果。在TAU-NIGENS 2020数据集上的实验结果表明，所提出的TBCF-MTNN网络在声音检测任务中的分类错误率能够减小至0.26；在声源定位任务中与Baseline相比较其定位误差减小至4.7°；通过和Baseline、FPN、EIN等模型相比较，结果表明所提网络具有更优的识别检测效果。

关键词: 声音事件定位与检测, 深度学习, Transformer模型, Cross-stitch；特征融合

Abstract: Aiming at the problem of multi-channel environmental sound detection, a feature fusion network model TBCF-MTNN is proposed, which introduces the Transformer structure. The network structure takes logarithmic Mel-spectrum and generalized cross-correlation spectrum as input. Firstly, the local features of the spectrum and the temporal context relationship features are obtained through CNN and GRU, and then the two feature maps are merged through the Cross-stitch module, which can effectively solve the traditional problem that multi-feature information cannot be shared in the network. Secondly, the fused feature map is sent to Transformer for re-collection of features. Finally the classification and positioning results are output through the full link layer. The verification on TAU-NIGENS 2020 data set show that, compared with the Baseline model, the TBCF-MTNN network can reduce the classification error rate to 0.26 in the sound detection task, and reduce the localization error to 4.7° in the sound source localization task. Compared with Baseline, FPN, EIN and other models, the proposed model has a better recognition effect.

Key words: sound event localization and detection, deep learning, Transformer model, Cross-stitch, feature fusion

濮子俊, 张寿明. 基于特征融合与Transformer模型的声音事件定位与检测算法研究[J]. 计算机工程与科学, 2023, 45(06): 1097-1105.

PU Zi-jun, ZHANG Shou-ming. A sound event localization and detection algorithm based on feature fusion and Transformer model[J]. Computer Engineering & Science, 2023, 45(06): 1097-1105.

[1]	丁建平, 李卫军, 刘雪洋, 陈旭. 命名实体识别研究综述[J]. 计算机工程与科学, 2024, 46(07): 1296-1310.
[2]	胡昭华, 王长富, . 改进Faster R-CNN的遥感图像小目标检测算法[J]. 计算机工程与科学, 2024, 46(06): 1063-1071.
[3]	谭郁松, 王伟, 蹇松雷, 易超雄. 基于异常保持的弱监督学习网络入侵检测模型[J]. 计算机工程与科学, 2024, 46(05): 801-809.
[4]	高珊, 李世杰, 蔡志平. 基于深度学习的中文文本分类综述[J]. 计算机工程与科学, 2024, 46(04): 684-692.
[5]	罗月童, 李超, 周波, 张延孔. 面向工业缺陷分类的交互式易混淆缺陷分离方法研究[J]. 计算机工程与科学, 2024, 46(03): 463-470.
[6]	吕伏, 韩晓天, 冯永安, 项梁. 基于自适应纹理特征融合的纹理图像分类方法[J]. 计算机工程与科学, 2024, 46(03): 488-498.
[7]	吉旭瑞, 魏德健, 张俊忠, 张帅, 曹慧. 中文电子病历信息提取方法研究综述[J]. 计算机工程与科学, 2024, 46(02): 325-337.
[8]	黄泽彪, 董德尊, 齐星云. Gloo+：利用在网计算技术加速分布式深度学习训练[J]. 计算机工程与科学, 2024, 46(01): 28-36.
[9]	邱晓梦, 王琳, 谷文俊, 宋伟, 田浩来, 胡誉. 光流法修正的时序图像语义分割模型[J]. 计算机工程与科学, 2024, 46(01): 102-110.
[10]	崔浩, 万亚平, 钟华, 聂明星, 肖杨. 基于LoRa设备的人体活动识别研究[J]. 计算机工程与科学, 2024, 46(01): 111-121.
[11]	焦佳辉, 马思远, 宋玉, 宋伟. 基于卷积注意力机制的双模态音乐流派分类模型MGTN[J]. 计算机工程与科学, 2023, 45(12): 2226-2236.
[12]	张骞, 陈紫强, 孙宗威, 赖镜安. 融合高分辨率网络的雾天目标检测算法[J]. 计算机工程与科学, 2023, 45(11): 1970-1981.
[13]	刘雨墨, 刘剑飞, 郝禄国, 曾文彬. 基于多尺度特征融合网络的HEVC帧内编码单元快速划分研究[J]. 计算机工程与科学, 2023, 45(11): 1991-1998.
[14]	李卓璇, 周亚同. 改进DBNet的电商图像文字检测算法研究[J]. 计算机工程与科学, 2023, 45(11): 2008-2017.
[15]	马志峰, 张浩, 刘劼. 基于深度学习的短临降水预报综述[J]. 计算机工程与科学, 2023, 45(10): 1731-1753.