• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (12): 2121-2134.

• 高性能计算 • 上一篇    下一篇

多面体模型下的循环置换与自动调优

彭畅1,2,刘青枝1,2,陈长波1,2   

  1. (1.中国科学院重庆绿色智能技术研究院,重庆 400714;2.中国科学院大学重庆学院,重庆 400714)
  • 收稿日期:2023-03-06 修回日期:2023-08-22 接受日期:2023-12-25 出版日期:2023-12-25 发布日期:2023-12-14
  • 基金资助:
    国家重点研发计划(2020YFA0712300);重庆英才计划青年拔尖项目 (2021000263);中国科学院“西部之光”;重庆市院士牵头科技创新引导专项(cstc2019yszx-jcyjX0003,cstc2020yszx-jcyjX0005,cstc2021yszx-jcyjX0005)

Loop permutation and auto-tuning under polyhedral model

PENG Chang1,2,LIU Qing-zhi1,2,CHEN Chang-bo1,2   

  1. (1Chongqing Institute of Green and Intelligent Technology,Chinese Academy of Sciences,Chongqing 400714;
    2..Chongqing  School,University of Chinese Academy of Sciences,Chongqing 400714,China)
  • Received:2023-03-06 Revised:2023-08-22 Accepted:2023-12-25 Online:2023-12-25 Published:2023-12-14

摘要: 针对常用多面体编译器Pluto默认循环调度和分块大小性能欠佳的问题,提出了一种为其调度计算多种合法置换,根据置换和分块大小构成的配置空间为循环程序自动调优的方法。通过对定义循环融合的标量维度的处理,实现了非完美嵌套循环块间和块内的同时置换。构建了4种机器学习驱动的自动调优策略,为循环程序在指定问题规模下寻找优化的置换序和分块大小组合。默认分块大小下,扩展后的Pluto编译器并行环境下生成的最佳置换相较于Pluto的默认调度取得了最高4.02和几何平均2.12的加速比。通过进一步搜索更优的置换序和分块大小组合,最好的自动调优策略在并行环境下相较于Pluto的默认优化取得了最高5.48和几何平均2.86的加速比。此外,指定问题规模下,自动调优得到的最佳配置和学习模型应用于相似问题规模时,相较于Pluto的默认优化也取得了不同程度的性能提升。

关键词: 循环置换, 循环分块, 多面体模型, 自动调优, 机器学习

Abstract: Aiming at improving the performance of the default loop scheduling and tile size of Pluto, a commonly used polyhedral compiler, this paper proposes a method to compute a variety of legal permutations for its default scheduling and auto-tune its performance according to the configuration space composed of permutations and tile sizes. Through the processing of scalar dimension that defines loop fusion, both intra and inter permutations for imperfect loop nest are realized. Four machine learning driven auto-tuning strategies are proposed to find the optimized combination of permutation order and tile size for a loop with a given problem size. Under the default tile size, the optimal permutation gene- rated by the extended Pluto compiler in a parallel environment achieves a maximum speedup of 4.02  and a geometric mean of 2.12 compared with the default scheduling of Pluto. By further searching for a better combination of permutation order and tile size, the best auto-tuning strategy achieves a maximum speedup of 5.48  and a geometric mean of 2.86 compared with Pluto's default optimization in a parallel environment. In addition, the best configuration and the learned model obtained by auto-tuning for a particular problem size, when being applied to similar problem sizes, also outperform the default optimization of Pluto in various degrees.



Key words: loop permutation, loop tiling, polyhedral model, auto-tuning, machine learning