• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

J4 ›› 2015, Vol. 37 ›› Issue (01): 1-6.

• 论文 •     Next Articles

Methods to enhance reliability and serviceability of parallel
computing software on large scale clusters  

LIN Yanyu,CHEN Hu,MIAO Jun,HAN Jialongmei,LAI Lushuang   

  1. (School of Software,South China University of Technology,Guangzhou 510006,China)
  • Received:2013-09-24 Revised:2013-12-18 Online:2015-01-25 Published:2015-01-25

Abstract:

Parallel computing software on largescale clusters requires not only fault tolerance against local nodes or network failure,but also manageability,maintainability,portability and scalability. Based on the star model,we design a parallel computing framework and achieve systemwide fault tolerance, usability,portability and scalability,using methods such as the variable granularity decomposer and associated queue on the scheduling nodes.Our system can continuously run over 150 hours with 300 TFlops computational capability.Besides,the system is scalable.

 

Key words: availability;scalability;serviceability;large scale cluster;parallel computing software