• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2020, Vol. 42 ›› Issue (10高性能专刊): 1842-1851.

• 高性能计算机系统应用 • 上一篇    下一篇

“神威·太湖之光”上Tend_lin应用的并行优化研究

JIANG Shang-zhi1,TANG Sheng-lin2,GAO Xi-ran2,HUA Rong1,CHEN Li2,LIU Ying2   

  1. (1.山东科技大学计算机科学与工程学院,山东 青岛 266590;

    2.中国科学院计算技术研究所计算机体系结构国家重点实验室,北京 100190)
  • 收稿日期:2020-06-11 修回日期:2020-07-29 接受日期:2020-10-25 出版日期:2020-10-25 发布日期:2020-10-23
  • 基金资助:
    国家重点研发计划(2016YFB0200803);国家自然科学基金(61521092)

Parallel optimization of Tend_lin application on the Sunway TaihuLight supercomputer

姜尚志1,唐生林2,高希然2,花嵘1,陈莉2,刘颖2#br#

#br#
  

  1. (1.College of Computer Science and Engineering,Shandong University of Science and Technology,Qingdao 266590;

    2.State Key Laboratory of Computer Architecture,Institute of Computing Technology,
    Chinese Academy of Sciences,Beijing 100190,China)

  • Received:2020-06-11 Revised:2020-07-29 Accepted:2020-10-25 Online:2020-10-25 Published:2020-10-23

摘要: 大气环流模式是研究全球气候变化及其成因的主要工具之一,在大规模异构众核的并行计算系统上高效地并行运行复杂的大气环流模式是一个具有挑战性的课题。Tend_lin是中国科学院大气物理研究所研发的第4代大气环流模式IAP AGCM-4中动力框架的热点过程,具有计算/通信比低的特点。面向国产大规模异构众核超算平台“神威·太湖之光”,用OpenACC和AceMesh 2种不同的并行编程接口对Tend_lin进行优化。重点介绍了如何用数据驱动的任务并行编程接口AceMesh对其进行加速,介绍了计算循环和通信代码的任务并行方法,讨论了如何放松通信资源共享,对比了单层任务图和嵌套任务图下的任务映射等优化问题。测试结果表明,相比OpenACC,AceMesh在16~1 024进程的不同并行配置下获得了平均2倍左右的性能提升,最后详细分析了性能收益的来源。

关键词: 大气环流模式, 高分辨率, 数据驱动的任务并行语言, OpenACC, MPI

Abstract:

Numerical simulation of the global atmospheric circulation is one of the main tools to understand the formation and dynamic behaviors of global climate, and it is also a great challenge to port and optimize such a complex application onto large scale heterogeneous platforms. Tend_lin is the hot spot of the dynamic core of IAP AGCM-4 (the 4th generation of IAP atmospheric general circulation model), and it has a low compute-to-communication ratio. The paper ports Tend_lin to SunWay Taihulight (a large scale heterogeneous computing platform) using two different parallel application programming interfaces. The paper introduces how to parallelize the program using a data-driven parallel application programming interface AceMesh, the task parallelization method of computation loops and MPI communication, how to relax the sharing of the communication resources, and the task mapping diffe- rences between a single-level task graph and a nested task graph. The experimental results show that AceMesh can attain more than 2 times speedups compared with the OpenACC version when using 16 to 1 024 processes. The paper analyzes and explains the reasons of the performance improvement.





Key words: global atmospheric general circulation model, high resolution, data driven task parallel language, OpenACC, MPI