• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (08): 1331-1339.

• 高性能计算 • 上一篇    下一篇

基于SRAM缓存和存内计算的低功耗关键词唤醒系统

黄至锐1,2,贾心茹1,2,朱浩哲1,2,陈迟晓1,2   

  1. (1.复旦大学集成芯片与系统全国重点实验室, 上海 200433;
    2.复旦大学芯片与系统前沿技术研究院,上海 200438)

  • 收稿日期:2023-11-10 修回日期:2023-12-27 接受日期:2024-08-25 出版日期:2024-08-25 发布日期:2024-09-02
  • 基金资助:
    国家重点研发计划(2022YFB4500101)

A low-power keyword spotting system with SRAM buffer and computing-in-memory

HUANG Zhi-rui1,2,JIA Xin-ru1,2 ,ZHU Hao-zhe1,2 ,CHEN Chi-xiao1,2   

  1. (1.State Key Laboratory of Integrated Chips and Systems,Fudan University,Shanghai 200433;
    2.Frontier Institute of Chip and System,Fudan University,Shanghai 200438,China)
  • Received:2023-11-10 Revised:2023-12-27 Accepted:2024-08-25 Online:2024-08-25 Published:2024-09-02

摘要: 为了解决关键词唤醒算法部署在边缘计算硬件会带来较高功耗、给电池驱动的设备带来续航挑战的问题,提出了一种基于存内计算技术和软硬件协同优化的低功耗关键词唤醒系统。在算法层面,基于标准MFCC算法拓扑结构提出了一种三值量化MFCC-CNN联合算法,将MFCC中的全部通用矩阵乘映射到神经网络加速器当中。在电路层面,提出了一种基于SRAM的存内计算核心,用于解决传统冯·诺依曼架构加速器存在的功耗墙和存储墙问题。同时通过复用存内计算核心的SRAM存储功能提出了一种基于查找表实现的缓存电路,用于替代寄存器延迟链电路。SRAM存内计算核心和SRAM缓存电路均采用定制单元实现。在系统层面,基于以上2种定制电路设计了一种低功耗关键词唤醒系统。该系统采用ASIC与定制化电路设计流程设计,并使用28 nm CMOS 工艺库对该设计进行了ASIC综合,在250 kHz下,关键词唤醒系统运行10分类任务的延迟是64 ms,整体功耗为645.28 μW,其中MFCC流水线的动态功耗占总动态功耗的5.9%,总功耗仅占系统功耗的1.3%。

关键词: 关键词唤醒, 三值量化神经网络, 存内计算, 串行快速傅里叶变换, 软硬件协同设计

Abstract: This paper proposes a low-power keyword spotting (KWS) system to overcome the problem of high-power consumption caused by deploying KWS algorithms on edge computing hardware, which can significantly impact the endurance of mobile devices. The proposed KWS system is based on computing-in-memory (CIM) technology and software-hardware co-design. In terms of algorithm, a ternary quantized MFCC-CNN joint algorithm based on the standard MFCC algorithm topology is proposed. All the general matrix multiplication (GEMM) in MFCC is mapped to the neural network accelerator. At the circuit level, the proposed system uses a computing-in-memory (CIM) core based on SRAM to overcome the power and memory walls in traditional von Neumann architecture accelerators. Additionally, a SRAM buffer circuit based on a look-up table is proposed to replace the register delay chain, which multiplexes the memory array in the CIM core. Both the SRAM-based CIM core and buffer are implemented using custom circuit units. At the system level, a low-power KWS system is proposed utilizing the two customized circuits discussed above. The system is implemented using ASIC and customized circuit design methods and synthesized using a 28 nm process library. The proposed system achieves a processing delay of 64 ms on 10 classification tasks, with a total power consumption of 645.28 μW. The dynamic power consumption of the MFCC pipeline accounts for 5.9% of the total dynamic power consumption, and the total power consumption accounts for only 1.3% of the system's power consumption.

Key words: keyword spotting, ternary quantized neural network, computing-in-memory, serial fast Fourier transform (FFT), software-hardware co-design