• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 人工智能与数据挖掘 • 上一篇    下一篇

基于残差网络和随机森林的音频识别方法

张晓龙1,2,3,彭宜1,2,3   

  1. (1.智能信息处理与实时工业系统湖北省重点实验室,湖北 武汉 430065;
    2.武汉科技大学大数据科学与工程研究院,湖北 武汉 430065;3.武汉科技大学计算机科学与技术学院,湖北 武汉  430065)
     
  • 收稿日期:2018-09-15 修回日期:2018-11-22 出版日期:2019-04-25 发布日期:2019-04-25
  • 基金资助:

    国家自然科学基金(61273225)

An audio recognition method based on
residual network and random forest

ZHANG Xiaolong1,2,3,PENG Yi1,2,3   

  1. (1.Hubei Key Laboratory of Intelligent Information Processing and RealTime Industrial System,Wuhan 430065;
    2.Institute of Big Data Science and Engineering,Wuhan University of Science and Technology,Wuhan 430065;
    3.School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan 430065,China)
     
  • Received:2018-09-15 Revised:2018-11-22 Online:2019-04-25 Published:2019-04-25

摘要:

环境声音分类(ESC)是音频处理领域的重要分支之一,在未来多媒体应用中有重要的作用。音频识别是提取音频中特定的声学特性,将音频分类至样本对应的正确场景,有助于感知和理解周围环境。现阶段音频识别主要是通过信号处理技术和机器学习方法达成。随着人工智能飞速发展,传统的音频处理技术以及机器学习方法面临着巨大的挑战,ESC的识别准确性有待进一步提高。结合残差网络和随机森林两种方法,将一维时域信号的音频数据转换为二维数据形式的梅尔声谱图,预训练残差网络获得一个精度较高的网络模型作为特征提取器,利用该网络模型提取音频中的深层特征,再利用随机森林对深层特征进行分类。该方法在ESC任务上识别率提升了近10%,取得了较好的分类结果。

关键词: 残差网络, 随机森林, 音频识别, 梅尔声谱图

Abstract:

Environmental sound classification (ESC) is one of the important branches in the field of audio processing, and it plays an important role in future multimedia applications. Audio recognition is the process of perceiving and understanding the surrounding environment by extracting the specific acoustic characteristics of the audio and classifying the audio into the correct scene corresponding to the sample. At present, audio recognition is mainly achieved through signal processing technology and machine learning methods. Along with the rapid development of artificial intelligence, traditional audio processing technology and machine learning methods are facing severe challenges. The recognition accuracy in ESC tasks remains to be further improved. We propose an audio recognition method which combines the residual network with random forest, and converts one-dimensional time domain signals of audio data into two-dimensional data in the form of MEL spectrograms. Pretraining the residual network can obtain a network model with high precision which is then used as a feature extractor. The network model is utilized to extract deep audio features and the random forest is used to classify the deep features. This method improves the recognition rate of ESC by nearly 10% and achieves better classification accuracy.

Key words: residual network, random forest, audio recognition, MEL spectrogram