• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2010, Vol. 32 ›› Issue (10): 38-43.doi: 10.3969/j.issn.1007130X.2010.

• 论文 • Previous Articles     Next Articles

A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems

FU Hongyi,YANG Xuejun   

  1. (National Laboratory for Parallel and  Distributed Processing,Changsha 410073,China)
  • Received:2009-05-11 Revised:2009-09-27 Online:2010-09-29 Published:2010-08-02

Abstract:

Fault tolerance is critical to computer systems. Recently,as the ever increasing complexity of architecture and the development of semiconductor techniques,the density of chips becomes much higher. As a consequence,the reliability issue of computer systems emerges,not only for largescale parallel systems,but also for distributed environments,even desktop applications. This paper reviews a number of typical faulttolerance techniques concerning hardware faults proposed in recent years,especially for those designed for largescale parallel systems,draws some preliminary conclusions,and puts forward several potential  research topics of this domain.

Key words: largescale parallel computing;faulttolerance techique;reliability