Computer Engineering & Science >
A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems
Received date: 2009-05-11
Revised date: 2009-09-27
Online published: 2010-08-02
Fault tolerance is critical to computer systems. Recently,as the ever increasing complexity of architecture and the development of semiconductor techniques,the density of chips becomes much higher. As a consequence,the reliability issue of computer systems emerges,not only for largescale parallel systems,but also for distributed environments,even desktop applications. This paper reviews a number of typical faulttolerance techniques concerning hardware faults proposed in recent years,especially for those designed for largescale parallel systems,draws some preliminary conclusions,and puts forward several potential research topics of this domain.
FU Hongyi,YANG Xuejun . A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems[J]. Computer Engineering & Science, 2010 , 32(10) : 38 -43 . DOI: 10.3969/j.issn.1007130X.2010.
/
| 〈 |
|
〉 |