J4 ›› 2015, Vol. 37 ›› Issue (01): 1-6.
• 论文 • Next Articles
LIN Yanyu,CHEN Hu,MIAO Jun,HAN Jialongmei,LAI Lushuang
Received:
Revised:
Online:
Published:
Abstract:
Parallel computing software on largescale clusters requires not only fault tolerance against local nodes or network failure,but also manageability,maintainability,portability and scalability. Based on the star model,we design a parallel computing framework and achieve systemwide fault tolerance, usability,portability and scalability,using methods such as the variable granularity decomposer and associated queue on the scheduling nodes.Our system can continuously run over 150 hours with 300 TFlops computational capability.Besides,the system is scalable.
Key words: availability;scalability;serviceability;large scale cluster;parallel computing software
LIN Yanyu,CHEN Hu,MIAO Jun,HAN Jialongmei,LAI Lushuang. Methods to enhance reliability and serviceability of parallel computing software on large scale clusters [J]. J4, 2015, 37(01): 1-6.
0 / / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://joces.nudt.edu.cn/EN/
http://joces.nudt.edu.cn/EN/Y2015/V37/I01/1