面向分布式流体系结构的多副本积极容错技术

J4 ›› 2015, Vol. 37 ›› Issue (12): 2233-2241.

面向分布式流体系结构的多副本积极容错技术

李鑫1,3,4，林宇斐2，郭晓威1

（1.国防科学技术大学高性能计算国家重点实验室，湖南长沙 410073；2.国防科学技术大学研究生院，湖南长沙 410073；
3.解放军理工大学，江苏南京 210007；4.总参第六十三研究所，江苏南京 210007）

收稿日期:2015-09-08 修回日期:2015-11-26 出版日期:2015-12-25 发布日期:2015-12-25
基金资助:
国家自然科学基金资助项目（61221491，61303071）

A triple modular eager redundancy faulttolerant
technique for distributed stream architecture

LI Xin1，3，4,LIN Yufei2,GUO Xiaowei1

(1.The State Key Laboratory of High Performance Computing,National University of Defense Technology,Changsha 410073;
2.Graduate School,National University of Defense Technology,Changsha 410073;
3.PLA University of Science and Technology,Nanjing 210007;
4.The 63rd Research Institute of PLA General Staff Headquarters,Nanjing 210007,China)

Received:2015-09-08 Revised:2015-11-26 Online:2015-12-25 Published:2015-12-25

摘要/Abstract

摘要：

随着互联网环境下计算系统规模的不断扩大，分布式流体系结构的可靠性问题面临着严峻的挑战。以多模冗余容错技术为基础，针对软错误提出了一种面向分布式流体系结构的多副本积极容错技术TREFT，利用三个程序副本进行高效的检错与纠错。在分布式流体系结构原型系统上的实验结果表明，该技术能有效提高系统的可靠性，具有较低的容错成本，平均增加1077%的容错开销。

关键词: 分布式流体系结构, 容错技术, 三模冗余

Abstract:

As computing systems continue to expand in size in the Internet environment, the reliability of the distributed stream architecture is facing serious challenges. Based on the Nmodular redundancy technique, we propose a triple modular eager redundancy faulttolerant method for the distributed stream architecture (TREFT). The TREFT employs three program copies to run the error detection and error correction processes efficiently. Experimental results on a prototype system of the distributed stream architecture show that the TREFT could enhance the reliability of the system at very low cost, increasing the faulttolerant cost by 10.77% on average.

Key words: distributed stream architecture;faulttolerant technique;triple modular redundancy

李鑫1,3,4，林宇斐2，郭晓威1. 面向分布式流体系结构的多副本积极容错技术[J]. J4, 2015, 37(12): 2233-2241.

LI Xin1，3，4,LIN Yufei2,GUO Xiaowei1. A triple modular eager redundancy faulttolerant
technique for distributed stream architecture [J]. J4, 2015, 37(12): 2233-2241.

[1]	李鑫1, 3，郭晓威1，林宇斐2. 数据流Eager传输：一种分布式流体系结构中的性能优化技术[J]. J4, 2015, 37(11): 2035-2044.
[2]	祝龙婷1，武继刚1，姜桂圆2，王超1. 环网处理器阵列的容错重构技术[J]. J4, 2015, 37(08): 1423-1429.
[3]	徐冉冉1,2，孟海波1，桂小琰2,申小伟1，安述倩1. 面向门级网表的VLSI三模冗余加固设计[J]. J4, 2014, 36(12): 2355-2360.
[4]	富弘毅,杨学军. 大规模并行计算机系统硬件故障容错技术综述[J]. J4, 2010, 32(10): 38-43.
[5]	王堃[1] 李少青[2]. 基于CompactPCI技术的双冗余网卡设计[J]. J4, 2008, 30(6): 149-151.
[6]	王友瑞刘芳戴葵王志英. 高可靠8051中系统管理单元的设计与实现[J]. J4, 2008, 30(2): 115-118.
[7]	马民金士尧. 三模容错多处理器动态实时调度算法[J]. J4, 2007, 29(2): 76-78.
[8]	段智勇张大方张伟鹏. 一个容错的网格资源选择算法[J]. J4, 2005, 27(4): 68-70.
[9]	谢宝湘金士尧等. 实时双机系统中检查点设置周期的选择[J]. J4, 2001, 23(1): 90-92.