• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (07): 1149-1158.

• 高性能计算 • 上一篇    下一篇

面向迈创+MatrixZone异构系统的深度学习编程框架

康宇晗1,时洋2,陈照云2,文梅2   

  1. (1.湖南师范大学信息科学与工程学院,湖南 长沙 410081;2国防科技大学计算机学院,湖南 长沙 410073)
  • 收稿日期:2023-01-01 修回日期:2023-03-27 接受日期:2023-07-25 出版日期:2023-07-25 发布日期:2023-07-11
  • 基金资助:
    国家自然科学基金(62002366)

A deep learning programming framework for FT-Matrix DSP+MatrixZone heterogeneous systems

KANG Yu-han1,SHI Yang2,CHEN Zhao-yun2,WEN Mei2   

  1. (1.School of Information Science and Engineering,Hunan Normal University,Changsha 410081;
    2.College of Computer Science and Technology,National University of Defense Technology,Changsha 410073,China)
  • Received:2023-01-01 Revised:2023-03-27 Accepted:2023-07-25 Online:2023-07-25 Published:2023-07-11

摘要: 为了满足深度学习模型迭代速度快、算力要求高的需求,主流硬件厂商愈发倾向于采用通用处理器+AI专用加速核的异构系统。但是,由于AI专用加速核仅支持部分核心算子,不具备通用编程能力,如何在这样的异构架构上完成深度学习任务的高效部署值得被深入研究。基于国产自研迈创+MatrixZone异构系统平台,设计并实现了深度学习编程框架KaiSa。KaiSa通过分析深度学习模型输入参数,识别算子类型并划分至对应计算核;对于复杂算子,KaiSa基于性能模型自动完成最优分块大小的搜索,提升双核并行计算的性能。同时,为了实现程序的高效率开发,KaiSa屏蔽了所有的底层硬件细节,给用户提供了一个友好的编程环境。实验结果表明,KaiSa可以获得高达39.0%的性能提升。

关键词: 深度学习, 飞腾迈创, 脉动加速器, 异构系统, 性能优化

Abstract: To meet the fast iteration speed and high computing power requirements of deep learning models, mainstream hardware vendors are increasingly inclined towards heterogeneous systems consisting of general-purpose processors and AI-specific accelerator cores. However, AI-specific accelerator cores only support certain core operators and do not have general programming capabilities. Therefore, how to efficiently deploy deep learning tasks on such heterogeneous architectures is worth further research. Based on the domestically developed FT-Matrix DSP+MatrixZone heterogeneous system platform, this paper designs and implements a deep learning programming framework, called KaiSa. KaiSa analyzes the input parameters of the deep learning model, identifies the operator type, and assigns it to the corresponding computing core. For complex operators, KaiSa automatically completes the optimal search for the block size based on a performance model, improving the performance of dual-core parallel computing. At the same time, KaiSa shields all low-level hardware details to provide users with a friendly programming environment for efficient program development. Experimental results show that KaiSa can achieve performance improvements of up to 39.0%.

Key words: deep learning;FT-Matrix, MatrixZone;heterogeneous system;performance optimization