大规模并行计算机系统硬件故障容错技术综述
收稿日期: 2009-05-11
修回日期: 2009-09-27
网络出版日期: 2010-08-02
基金资助
国家自然科学基金资助项目(60621003,60633050)
A Survey of the FaultTolerance Techniques for LargeScale Parallel Computing Systems
Received date: 2009-05-11
Revised date: 2009-09-27
Online published: 2010-08-02
富弘毅,杨学军 . 大规模并行计算机系统硬件故障容错技术综述[J]. 计算机工程与科学, 2010 , 32(10) : 38 -43 . DOI: 10.3969/j.issn.1007130X.2010.
Fault tolerance is critical to computer systems. Recently,as the ever increasing complexity of architecture and the development of semiconductor techniques,the density of chips becomes much higher. As a consequence,the reliability issue of computer systems emerges,not only for largescale parallel systems,but also for distributed environments,even desktop applications. This paper reviews a number of typical faulttolerance techniques concerning hardware faults proposed in recent years,especially for those designed for largescale parallel systems,draws some preliminary conclusions,and puts forward several potential research topics of this domain.
/
| 〈 |
|
〉 |