[1]Meuer H,Simon H,Strohmaier E,et al. TOP500 supercomputer sites[EB/OL].[20150901]. http://www.top500.org.
[2]Lu CD. Scalable diskless checkpointing for large parallel systems[D]. UrbanaChampaign:University of Illinois,2005.
[3]Bosilca G,Bouteiller A,Cappello F,et al. MPICHV:Toward a scalable fault tolerant MPI for volatile nodes[C]∥Proc of SC’02,2002:118.
[4]Engelmann C,Geist A. A diskless checkpointing algorithm for superscale architectures applied to the fast Fourier transform[C]∥Proc of the 1st International Workshop on Challenges of Large Applications in Distributed Environments,2003:4752.
[5]Los Alamos National Laboratory. Operational data to support and enable computer science research[EB/OL]. [20150901]. http://institute.lanl.gov/data/lanldata.shtml.
[6]Wu M,Sun XH,Jin H. Performance under failures of highend computing[C]∥Proc of the 2007 ACM/IEEE Conference on Supercomputing,2007:111.
[7]Yang X,Liao X,Xu W,et al. TH1:Chinas first petaflop supercomputer[J]. Frontiers of Computer Science in China,2010,4(4):445455.
[8]Yang XJ,Liao XK,Lu K,et al. The TianHe1A supercomputer:Its hardware and software[J]. Journal of Computer Science and Technology,2011,26(3):344351.
[9]Borucki L,Schindlbeck G,Slayman C. Comparison of accelerated DRAM soft error rates measured at component and system level[C]∥Proc of the 2008 IEEE International Reliability Physics Symposium,2008:482487.
[10]Schroeder B,Pinheiro E,Weber WD. DRAM errors in the wild:A largescale field study[C]∥Proc of the 11th International Joint Conference on Measurement and Modeling of Computer Systems,2009:193204.
[11]Maruyama N,Nukada A,Matsuoka S. Softwarebased ECC for GPUS[C]∥Proc of the 2009 Symposium on Application Accelerators in High Performance Computing (SAAHPC’09),2009:1.
[12]Bronevetsky G,Marques D,Pingali K,et al. Automated applicationlevel checkpointing of MPI programs[C]∥Proc of the Symposium on Principles and Practice of Parallel Programming (PPoPP 2003),2003:8494.
[13]Wang Z Y,Yang X J,Zhou Y.Scalable triple modular redundancy fault tolerance mechanism for MPIoriented large scale parallel computing[J]. Journal of Software,2012,23(4):10221035. (in Chinese)
[14]Plank J S. A tutorial on reedsolomon coding for faulttolerance in RAIDlike systems[J]. SoftwarePractice & Experience,1997,27(9):9951012.
[15]Zaharia M,Chowdhury M,Das T,et al. Resilient distributed datasets:A faulttolerant abstraction for inmemory cluster computing [C]∥Proc of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association,2012:1.
[16]Li Xin,Guo Xiaowei,Lin Yufei. Stream eager transmission:The performance optimization technique for the distributed stream architecture[J]. Computer Engineering & Science,2015,37(11):20352044.(in Chinese)
[17]Song W. Research on fault tolerance for transactional memory system[D]. Changsha:National University of Defense Technology,2011. (in Chinese)
附中文参考文献:
[13]王之元,杨学军,周云. 大规模MPI并行计算的可扩展三模冗余容错机制[J]. 软件学报,2012,23(4):10221035.
[16]李鑫,郭晓威,林宇斐. 数据流Eager传输:一种分布式流体系结构中的性能优化技术[J]. 计算机工程与科学,2015,37(11):20352044.
[17]宋伟. 面向事务存储系统的容错技术研究[D]. 长沙:国防科学技术大学,2011. |