• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊
论文

A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems

Expand
  • (National Laboratory for Parallel and  Distributed Processing,Changsha 410073,China)

Received date: 2009-05-11

  Revised date: 2009-09-27

  Online published: 2010-08-02

Abstract

Fault tolerance is critical to computer systems. Recently,as the ever increasing complexity of architecture and the development of semiconductor techniques,the density of chips becomes much higher. As a consequence,the reliability issue of computer systems emerges,not only for largescale parallel systems,but also for distributed environments,even desktop applications. This paper reviews a number of typical faulttolerance techniques concerning hardware faults proposed in recent years,especially for those designed for largescale parallel systems,draws some preliminary conclusions,and puts forward several potential  research topics of this domain.

Cite this article

FU Hongyi,YANG Xuejun . A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems[J]. Computer Engineering & Science, 2010 , 32(10) : 38 -43 . DOI: 10.3969/j.issn.1007130X.2010.

Outlines

/