• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2026, Vol. 48 ›› Issue (4): 743-751.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于显著性检测器与衰减掩码自注意力模块的声音事件检测和定位研究

王春丽,陈善立,刘素倩,赵小春   

  1. (1.兰州交通大学电子与信息工程学院,甘肃 兰州 730070;
    2.甘肃省妇幼保健院(甘肃省中心医院)康复医学科,甘肃 兰州 730050)
  • 收稿日期:2024-05-21 修回日期:2024-08-16 出版日期:2026-04-25 发布日期:2026-04-30
  • 基金资助:
    内蒙古重点研发及成果转化项目(2023YFSH0043,2023YFDZ0043);甘肃省重点人才项目和兰州交通大学青年基金项目(LH2019005)

Sound event detection & localization based on saliency detector and decay mask self-attention module

WANG Chunli,CHEN Shanli,LIU Suqian,ZHAO Xiaochun   

  1. (1.School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070;
    2.Department of Rehabilitation Medicine,Gansu Provincial Maternity and Child-Care Hospital 
    (Gansu Provincial Central Hospital),Lanzhou 730050,China)
  • Received:2024-05-21 Revised:2024-08-16 Online:2026-04-25 Published:2026-04-30

摘要: 提出了一种基于显著性检测器与具有衰减掩码的多头自注意力结合的声学模型,此模型可以在执行声音事件检测与定位任务时更好地关注空间信息。通过显著性检测器在局部信息内关注显著性高的部分,使模型更加关注信息丰富度高的类别。其次在多头自注意力模块中引入了衰减掩码,这种设计可以使模型更加专注于局部信息,引入自适应约束使注意力头多样化。实验结果表明,提出的模型相较于基线模型性能更好,与融合Transformer和Multi-scale模型相比较,所提模型具有更优的检测与定位效果。最后利用视频信息充当额外数据来提升性能,表现出良好的性能。


关键词: 声音事件检测和定位, 显著性检测器, 多头自注意力, 自适应约束衰减掩码 ,

Abstract: A novel acoustic module is proposed, which combines a saliency detector with multi-head self-attention equipped with a decay mask. This model aids in better focusing on spatial information when performing sound event localization & detection tasks. By utilizing the saliency detector to concentrate on highly salient regions within local information, the model pays more attention to categories with rich information content. Secondly, a decay mask is introduced into the multi-head self-attention module, enabling the model to focus more on local information. Additionally, adaptive constraints are incorporated to diversify the attention heads. Experimental results demonstrate that the proposed model outperforms the baseline models. When compared with models that fuse Transformer and Multi-scale architectures, the proposed model exhibits superior detection  & localization performance. Finally, lev- eraging video information as additional data to enhance performance, the model demonstrates excellent overall capabilities.

Key words: sound event detection &, localization;saliency detector;multi-head self-attention;adaptive constrained decay mask