• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (01): 1-6.

• 论文 •    下一篇

提升大规模集群上并行计算软件系统可靠性和服务性的方法与实践

林彦宇,陈虎,苗军,韩佳龙媚,赖路双   

  1. (华南理工大学软件学院,广东 广州 510006)
  • 收稿日期:2013-09-24 修回日期:2013-12-18 出版日期:2015-01-25 发布日期:2015-01-25

Methods to enhance reliability and serviceability of parallel
computing software on large scale clusters  

LIN Yanyu,CHEN Hu,MIAO Jun,HAN Jialongmei,LAI Lushuang   

  1. (School of Software,South China University of Technology,Guangzhou 510006,China)
  • Received:2013-09-24 Revised:2013-12-18 Online:2015-01-25 Published:2015-01-25

摘要:

大规模集群上的并行计算软件需要具备处理部分节点、网络等失效的容错能力,也需要具有易于管理、维护、移植和可扩展的服务能力。针对星形计算模型,研究和开发了一套并行计算框架。利用调度节点内部的可变粒度分解器、相关队列等方法,实现了全系统容错,且具有较好的易用性、可移植性和可扩展性。系统目前可以实现300 TFlops计算能力下连续运行超过150 h,而且还具有进一步的可扩展能力。

关键词: 可靠性, 可扩展性, 服务性, 大规模集群, 并行计算软件

Abstract:

Parallel computing software on largescale clusters requires not only fault tolerance against local nodes or network failure,but also manageability,maintainability,portability and scalability. Based on the star model,we design a parallel computing framework and achieve systemwide fault tolerance, usability,portability and scalability,using methods such as the variable granularity decomposer and associated queue on the scheduling nodes.Our system can continuously run over 150 hours with 300 TFlops computational capability.Besides,the system is scalable.

 

Key words: availability;scalability;serviceability;large scale cluster;parallel computing software