• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (05): 802-809.

• 高性能计算 • 上一篇    下一篇

基于脉动阵列的层融合注意力模型加速器结构

刘晓航1,姜晶菲2,许金伟2   

  1. (1.国防科技大学研究生院,湖南 长沙 410073;
    2.国防科技大学并行与分布处理国家重点实验室,湖南 长沙 410073)
  • 收稿日期:2022-10-24 修回日期:2022-12-15 接受日期:2023-05-25 出版日期:2023-05-25 发布日期:2023-05-16
  • 基金资助:
    国家国防科技工业局国防科技重点实验室稳定支持重点项目(WDZC20215250103)

A fused-layer attention model accelerator based on systolic array

LIU Xiao-hang1,JIANG Jing-fei2,XU Jin-wei2   

  1. (1.Graduate College,National University of Defense Technology,Changsha 410073;
    2.Science and Technology on Parallel and Distributed Processing Laboratory,
    National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-24 Revised:2022-12-15 Accepted:2023-05-25 Online:2023-05-25 Published:2023-05-16

摘要: 注意力机制最近在深度神经网络中表现出优越的性能,但其计算包含复杂的数据流,内存开销和计算量大,需要定制加速器来优化推理计算。提出一种针对注意力机制计算的加速器结构。采用基于硬件控制的灵活分块方法,将模型中的巨大矩阵分成硬件亲和的计算块,使块矩阵的计算匹配加速器脉动阵列;提出基于双步softmax函数分解计算的层融合计算方法,有效减少了注意力模型计算对内存的访问。采用硬件描述语言HDL设计实现了细粒度计算调度的层融合注意力模型加速器结构。基于XILINX FPGA器件和HLS工具进行了性能评估。相同设置下,与CPU相比延迟加速了4.9倍,与GPU相比能效提升了1.24倍。

关键词: 脉动阵列, 注意力机制, 层融合, 加速器结构, 矩阵分块, 柔性最大值传输函数

Abstract: Attention mechanism has recently shown superior performance in deep neural networks, its computation generates complex data flow and requires high computation and memory overheads. Therefore, customized accelerators are required to optimize the inference computing. This paper pro- poses an accelerator architecture for attention mechanism computation. A flexible partitioning method based on hardware control is used to divide the huge matrices in the attention model into hardware-friendly computing blocks, which realizes the systolic array in accelerator matched by the block computation match. A layer fusion computing structure based on two-step softmax function decomposition is proposed, which effectively reduces the memory access of attention mechanism computation. A fused-layer attention model accelerator based on fine-grained computational scheduling is designed and implemented by HDL. The performance was evaluated based on the XLINIX FPGA device and  HLS tool. Compared with the CPU and GPU implementation under the same settings, the delay of accelerator was improved by 4.91 times, the efficiency of accelerator was improved by 1.24 times.

Key words: systolic array, attention mechanism, fused-layer, accelerator architecture, matrix block- ing, softmax ,