• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (08): 1423-1429.

• 论文 • 上一篇    下一篇

环网处理器阵列的容错重构技术

祝龙婷1,武继刚1,姜桂圆2,王超1   

  1. (1.天津工业大学计算机科学与软件学院,天津 300387;2.天津大学计算机科学与技术学院,天津 300072)
  • 收稿日期:2014-08-11 修回日期:2014-10-11 出版日期:2015-08-25 发布日期:2015-08-25
  • 基金资助:

    国家自然科学基金资助项目(61173032);国家自然科学基金天元青年基金资助项目(11326211)

Reconfiguration approaches for faulttolerant
torus-connected processor arrays  

ZHU Longting1,WU Jigang1,JIANG Guiyuan2,WANG Chao1   

  1. (1. School of Computer Science and Software Engineering,Tianjin Polytechnic University,Tianjin 300387;
    2. School of Computer Science and Technology,Tianjin University,Tianjin 300072,China)
  • Received:2014-08-11 Revised:2014-10-11 Online:2015-08-25 Published:2015-08-25

摘要:

高效的容错技术对于提高多处理器系统的可靠性至关重要。环网(Torus)是连接多处理器阵列的重要网络结构,而环网处理器阵列上的容错重构技术目前尚属空白。针对环网阵列的特殊连接方式,将环网阵列重构问题转化为矛盾图上求解最大独立集问题。矛盾图上的结点表示故障处理器的替换方案,而边代表了不同替换方案之间的不可共存特性。主要是根据三种不同的冗余处理器分布方案,设计生成矛盾图算法,求解最大独立集算法,以及由独立集生成逻辑处理器阵列算法,取得了令人满意的结果。实验结果表明,当阵列规模较小或故障率较低时,一行一列和十字型的冗余单元分布的重构能力较好;而随着阵列规模或故障率的增大,三种冗余单元分布策略的重构成功率都随之下降,但可通过增加冗余单元以及调整冗余分布来改善容错效果。此外,从实验结果中还可以看出,环网处理器阵列的容错能力显然优于网格(Mesh)处理器阵列。

关键词: 环网处理器阵列, 重构算法, 容错技术, 矛盾图

Abstract:

High-efficient fault-tolerant techniques are essential for improving the reliability of multiprocessor systems. It is well known that torus is an important interconnection network for multiprocessor arrays, but no work has been reported on the faulty tolerance of torus-connected processor arrays. In our work, reconfiguring a torus-connected processor array is modeled to be a maximum independent set problem. The nodes on the contradiction graph represent alternatives of the fault processing elements (PEs), and the edge denotes that different alternatives cannot coexist. Three different distributions of redundant PEs are discussed, and three algorithms are proposed to construct contradiction graphs, solve maximum independent set, and generate logic arrays based on the produced maximum independent set. Simulation results show that, the cross distribution and one-row-one-column distribution perform well in reconfiguration for smaller arrays and smaller fault densities. In addition, the reconfiguration ability of the three proposed distribution patterns decreases as the fault density and array size increase, thus other spare distribution patterns should be investigated, or more spare PEs should be integrated. Moreover, torus arrays outperform mesh arrays in terms of fault-tolerance performance.

Key words: torus-connected processor array;reconfiguration algorithm;fault-tolerance;contradiction graph