• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊
论文

Implementation and Optimization of  Stencil Applications on GPUs

Expand
  • (School of Computer Science,National University of Defense Technology,Changsha 410073,China)

Received date: 2009-07-26

  Revised date: 2009-10-21

  Online published: 2011-03-25

Abstract

With the fast development of GPUs, using them to accelerate scientific computing applications is becoming an inevitable trend. In this paper, we port two typical subroutines Rprj3 and Interp from Mgrid which contains rich stencil operations in SPEC2000 to run on an AMD GPU using Brook+. Using a thread granularity tuning mechanism provided by Brook+, we implement different ported program versions and analyze their performances. We also conclude how to utilize thread granularity tuning to optimize stencil program transplantation. Our experimental results show that under the largest problem size, Rprj3 obtains a speedup of 5.37 over its CPU version while Interp gains a speedup of 12.8 over its CPU version.

Cite this article

FANG Xudong,TANG Yuhua,WANG Guibin,TANG Tao . Implementation and Optimization of  Stencil Applications on GPUs[J]. Computer Engineering & Science, 2011 , 33(3) : 41 -45 . DOI: 10.3969/j.issn.1007130X.2011.

References

[1]AMD.ATI Stream Computing User Guide v1.4 Beta[EB/OL].[20090705]. http://developer.amd.com/gpu_assets/Stream_Computing_User_Guide.pdf.
[2]NVIDIA.Compute Unified Device Architecture Programming Guide v2.1 Beta[EB/OL].[20090625]. http://developer.download.nvidia. com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf.
[3]Ryoo S, Rodrigues C I, Baghsorkhi S S,et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA[C]∥Proc of the ACM SIGPLAN Symp on Principles and Practice of Parallel Programming,2008:7382.
[4]Datta K, Murphy M, Volkov V,et al. Stencil Computation Optimization and AutoTuning on StateoftheArt Multicore Architectures[C]∥Proc of the 2008 ACM/IEEE Conf on Supercomputing,2008:1521.
[5]Fan Z, Qiu F, Kaufman A,et al. GPU Cluster for High Performance Computing[C]∥Proc of the 2004 ACM/IEEE Conf on Supercomputing,2004:4753.
[6]Buck I. Brook Specification V0.2[EB/OL].[20090708]. http://hci.stanford.edu/cstr/reports/200304.pdf.
[7]Ryoo S, Rodrigues C I, Stone S S, et al.Program Optimization Carving for GPU Computing[J]. Journal of Parallel Distributed Computing, 2008,68(10):13891401.
[8]Ryoo S, Rodrigues C I, Baghsorkhi S S, et al. Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA[C]∥Proc of the ACM SIGPLAN Symp on Principles and Practice of Parallel Programming,2008:7382.
[9]Jang B, Do S, Pien H. ArchitectureAware Optimization Targeting Multithreaded Stream Computing[C]∥Proc of the Second Workshop on GeneralPurpose Compution on Graphics Processing Units,2009:6270.
[10]Wang G, Yang X J, Zhang Y, et al. Program Optimization of Stencil Based Application on the GPUAccelerated System[C]∥Proc of the Int’l Symp on Parallel and Distributed Processing and Applications,2009:219225.
[11]Li Z,Song Y. Automatic Tiling of Iterative Stencil Loops[J]. ACM Transactions on Programming Languages and Systems,2004,26(6):9751028.
[12]Krishnamoorthy S,Baskaran M M, Bondhugula U, et al. Effective Automatic Parallelization of Stencil Computations[J].SIGPLAN Notices,2007,42(6):235244.

Outlines

/