J4 ›› 2011, Vol. 33 ›› Issue (11): 54-59.
• 论文 • Previous Articles Next Articles
JIA Jia
Received:
Revised:
Online:
Published:
Abstract:
The applicationlevel checkpointing technique is one of the most commonly used and well matured faulttolerance techniques in homogenous systems. However, It is on its infant phase in heterogeneous systems and there are not accurate and reasonable solutions or approaches with respect to architectures and fault models of heterogeneous systems. Motivated by this observation, based on the architecture and programming model of the CUDA heterogeneous system, this paper analyzes the execution mode of CUDA programs running on CPUs and GPUs and proposes an asynchronous execution mechanism for the applicationlevel checkpointing technique in heterogeneous systems. With this mechanism, we explore a solution of optimal placement of checkpoints in heterogeneous systems. Finally, three experimental cases in the CUDA platform are used to evaluate our technique’s performance, feasibility and viability. The results demonstrate the effectiveness of our asynchronous execution mechanism for the applicationlevel checkpointing technique in heterogeneous systems. Compared with the synchronous execution mechanism, our mechanism is more flexible and has broader optimization space to explore. Moreover, Our solution of optimal placement of checkpoints can efficiently reduce checkpointing overhead and hence obtain higher performance.
Key words: applicationlevel checkpointing;heterogeneous system;asynchronous execution mechanism;optimal placement of checkpoints
JIA Jia. 国家自然科学基金资助项目(60921062,61003087)[J]. J4, 2011, 33(11): 54-59.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2011/V33/I11/54