模板操作在GPU上的实现与优化

方旭东,唐玉华,王桂彬,唐滔

doi:10.3969/j.issn.1007130X.2011.

计算机工程与科学 >

2011 , Vol. 33 >Issue 3: 41 - 45

DOI: https://doi.org/10.3969/j.issn.1007130X.2011.

论文

模板操作在GPU上的实现与优化

展开

（国防科学技术大学计算机学院，湖南长沙 410073）

方旭东(1985)，男,浙江诸暨人，硕士生，研究方向为计算机系统软件。方旭东(1985)，男,浙江诸暨人，硕士生，研究方向为计算机系统软件。王桂彬(1981),男，博士生，研究方向为计算机系统结构。唐滔(1984),男，博士生，研究方向为计算机系统结构。

收稿日期: 2009-07-26

修回日期: 2009-10-21

网络出版日期: 2011-03-25

基金资助

国家自然科学基金资助项目（60621003）

收起

Implementation and Optimization of Stencil Applications on GPUs

Expand

（School of Computer Science,National University of Defense Technology,Changsha 410073,China）

Received date: 2009-07-26

Revised date: 2009-10-21

Online published: 2011-03-25

Fold

摘要

随着GPU的快速发展，使用GPU来加速科学计算应用已成为必然趋势。本文抽取了SPEC2000中富含模板操作的Mgrid的两个典型子程序Rprj3和Interp，使用Brook+语言把它们移植到AMD GPU上运行。采用Brook+语言提供的线程调节机制，我们实现了不同线程粒度下的程序版本，并分析了加速比不同的原因，总结了线程粒度调节对模板程序移植的指导意义。我们使用AMD Radeon HD4870 GPU作为实验平台，对比Intel Xeon E5405 CPU上的运行结果发现，在最大规模下，Rprj3获得的相对于CPU版本的加速比为5.37×, Interp获得的相对于CPU版本的加速比为12.8×。

关键词： GPU; 优化; 模板

本文引用格式

方旭东,唐玉华,王桂彬,唐滔 . 模板操作在GPU上的实现与优化[J]. 计算机工程与科学, 2011 , 33(3) : 41 -45 . DOI: 10.3969/j.issn.1007130X.2011.

Abstract

With the fast development of GPUs, using them to accelerate scientific computing applications is becoming an inevitable trend. In this paper, we port two typical subroutines Rprj3 and Interp from Mgrid which contains rich stencil operations in SPEC2000 to run on an AMD GPU using Brook+. Using a thread granularity tuning mechanism provided by Brook+, we implement different ported program versions and analyze their performances. We also conclude how to utilize thread granularity tuning to optimize stencil program transplantation. Our experimental results show that under the largest problem size, Rprj3 obtains a speedup of 5.37 over its CPU version while Interp gains a speedup of 12.8 over its CPU version.

Key words： GPU;optimization;stencil

参考文献

［1］AMD.ATI Stream Computing User Guide v1.4 Beta［EB/OL］.［20090705］. http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf.
［2］NVIDIA.Compute Unified Device Architecture Programming Guide v2.1 Beta［EB/OL］.［20090625］. http://developer.download.nvidia. com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf.
［3］Ryoo S, Rodrigues C I, Baghsorkhi S S,et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA［C］∥Proc of the ACM SIGPLAN Symp on Principles and Practice of Parallel Programming,2008:7382.
［4］Datta K, Murphy M, Volkov V,et al. Stencil Computation Optimization and AutoTuning on StateoftheArt Multicore Architectures［C］∥Proc of the 2008 ACM/IEEE Conf on Supercomputing,2008:1521.
［5］Fan Z, Qiu F, Kaufman A,et al. GPU Cluster for High Performance Computing［C］∥Proc of the 2004 ACM/IEEE Conf on Supercomputing,2004:4753.
［6］Buck I. Brook Specification V0.2［EB/OL］.［20090708］. http://hci.stanford.edu/cstr/reports/200304.pdf.
［7］Ryoo S, Rodrigues C I, Stone S S, et al.Program Optimization Carving for GPU Computing［J］. Journal of Parallel Distributed Computing, 2008,68(10):13891401.
［8］Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA［C］∥Proc of the ACM SIGPLAN Symp on Principles and Practice of Parallel Programming,2008:7382.
［9］Jang B, Do S, Pien H. ArchitectureAware Optimization Targeting Multithreaded Stream Computing［C］∥Proc of the Second Workshop on GeneralPurpose Compution on Graphics Processing Units,2009:6270.
［10］Wang G, Yang X J, Zhang Y, et al. Program Optimization of Stencil Based Application on the GPUAccelerated System［C］∥Proc of the Int’l Symp on Parallel and Distributed Processing and Applications,2009:219225.
［11］Li Z,Song Y. Automatic Tiling of Iterative Stencil Loops［J］. ACM Transactions on Programming Languages and Systems,2004,26(6):9751028.
［12］Krishnamoorthy S,Baskaran M M, Bondhugula U, et al. Effective Automatic Parallelization of Stencil Computations［J］.SIGPLAN Notices,2007,42(6):235244.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献