• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2011, Vol. 33 ›› Issue (11): 54-59.

• 论文 • Previous Articles     Next Articles

国家自然科学基金资助项目(60921062,61003087)

JIA Jia   

  1. (National Laboratory for Parallel and Distributed Processing,Changsha 410073,China)
  • Received:2010-05-20 Revised:2010-10-26 Online:2011-11-25 Published:2011-11-25

Abstract:

The applicationlevel checkpointing technique is one of the most commonly used and well matured faulttolerance techniques in homogenous systems. However, It is on its infant phase in heterogeneous systems and there are not accurate and reasonable solutions or approaches with respect to architectures and fault models of heterogeneous systems. Motivated by this observation, based on the architecture and programming model of the CUDA heterogeneous system, this paper analyzes the execution mode of CUDA programs running on CPUs and GPUs and proposes an asynchronous execution mechanism for the applicationlevel checkpointing technique in heterogeneous systems. With this mechanism, we explore a solution of optimal placement of checkpoints in heterogeneous systems. Finally, three experimental cases in the CUDA platform are used to evaluate our technique’s performance, feasibility and viability. The results demonstrate the effectiveness of our asynchronous execution mechanism for the applicationlevel checkpointing technique in heterogeneous systems. Compared with the synchronous execution mechanism, our mechanism is more flexible and has broader optimization space to explore. Moreover, Our solution of optimal placement of checkpoints can efficiently reduce checkpointing overhead and hence obtain higher performance.

Key words: applicationlevel checkpointing;heterogeneous system;asynchronous execution mechanism;optimal placement of checkpoints