• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2023, Vol. 45 ›› Issue (05): 802-809.

• High Performance Computing • Previous Articles     Next Articles

A fused-layer attention model accelerator based on systolic array

LIU Xiao-hang1,JIANG Jing-fei2,XU Jin-wei2   

  1. (1.Graduate College,National University of Defense Technology,Changsha 410073;
    2.Science and Technology on Parallel and Distributed Processing Laboratory,
    National University of Defense Technology,Changsha 410073,China)
  • Received:2022-10-24 Revised:2022-12-15 Accepted:2023-05-25 Online:2023-05-25 Published:2023-05-16

Abstract: Attention mechanism has recently shown superior performance in deep neural networks, its computation generates complex data flow and requires high computation and memory overheads. Therefore, customized accelerators are required to optimize the inference computing. This paper pro- poses an accelerator architecture for attention mechanism computation. A flexible partitioning method based on hardware control is used to divide the huge matrices in the attention model into hardware-friendly computing blocks, which realizes the systolic array in accelerator matched by the block computation match. A layer fusion computing structure based on two-step softmax function decomposition is proposed, which effectively reduces the memory access of attention mechanism computation. A fused-layer attention model accelerator based on fine-grained computational scheduling is designed and implemented by HDL. The performance was evaluated based on the XLINIX FPGA device and  HLS tool. Compared with the CPU and GPU implementation under the same settings, the delay of accelerator was improved by 4.91 times, the efficiency of accelerator was improved by 1.24 times.

Key words: systolic array, attention mechanism, fused-layer, accelerator architecture, matrix block- ing, softmax ,