• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

快速多极子方法在申威众核处理器上的实现和优化

王武1,王舒扬1,2,姜金荣1,孟虹松3   

  1. (1.中国科学院计算机网络信息中心,北京 100190;2.中国科学院大学,北京 100049;
    3.国家超级计算无锡中心,江苏 无锡 214072)
  • 收稿日期:2018-10-25 修回日期:2018-12-10 出版日期:2019-07-25 发布日期:2019-07-25
  • 基金资助:

    国家重点研发计划(2017YFB0203303);中国科学院十三五信息化应用工程项目(XXH13506-405)

Implementation and optimization of fast multipole
method on Sunway manycore processors
 

WANG Wu1,WANG Shuyang1,2,JIANG Jinrong1,MENG Hongsong3   

  1. (1.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190;
    2.University of Chinese Academy of Sciences,Beijing 100049;
    3.National Supercomputing Center in Wuxi,Wuxi 214072,China)
     
  • Received:2018-10-25 Revised:2018-12-10 Online:2019-07-25 Published:2019-07-25

摘要:

快速多极子方法(FMM)是一种求解N体问题的快速高效数值算法,在宇宙学和分子动力学等模拟中具有广泛的应用。申威SW26010是一款国产众核异构处理器,含260核心(4核组)。基于申威SW26010的众核架构设计和实现了快速多极子方法,并对核心函数(尤其是最耗时的粒子对相互作用)系统地进行了性能优化,包括异步DMA、SIMD向量化、循环展开、内联汇编指令调整等。以粒子对相互作用为例,优化后代码的计算速度约为主核上运行的原始代码的400倍,每个核组上的浮点性能达到250 GFLOPS,即理论峰值性能的32.5%。

 

关键词: 快速多极子方法, 异构众核处理器, N体问题, 性能优化

Abstract:

The fast multipole method (FMM) is a fast and efficient numerical algorithm for solving the Nbody problem and has various applications in cosmology and molecular dynamics. Sunway SW26010 is a heterogeneous manycore processor developed independently by China with 260 cores (4 core groups). We design and implement an FMM  on SW26010 manycore architecture. We also systematically optimize the performance  of kernel functions (especially for the most timeconsuming particle pair interaction), including asynchronous direct memory access (DMA), SIMD vectorization, loop unrolling and inline assembly tuning. Taking the particle pair interaction kernel as an example, the computational speed after optimization is about 400 times higher than the raw code running on the host core, and the floating-point performance on each core group is 250 GFLOPS, which is 32.5% of the theoretical peak performance.
 

Key words: fast multipole method (FMM), heterogeneous manycore processor, N-body problem, performance optimization