HPC cluster and low cost auto
scaling model based on public cloud

Abstract

Abstract:

For many HPC users, computing cost is one important factor for whether moving workloads to the public cloud. Alibaba cloud provides “preemptible instance”. It is an on-demand instance to reduce the cost of using public cloud computing resources. The market price of “preemptible instance” fluctuates and it can be as low as 10% of “pay as you go instance”. And “preemptible instance” cannot be kept as long as users’ requirement, and be released due to datacenter scheduler or some other reasons, so it can be used in some stateless scenarios. On the public cloud, based on users’ application types, job submission patterns, performance requirements, timing and cost, we propose an auto scaling strategy on the public cloud for general HPC cluster schedulers, which can automatically deploy computing resources and control cost. HPC users only pay for what they want and what they use. Due to abundant resource types and resource rent models, and taking advantages of auto scaling service, “preemptible instance” and application checkpoint/restart, we can supply a low cost auto scaling model. When users submit jobs, they can set their expectation cost, and the auto scaling service will find the “preemptible instance” under this cost setting, and use checkpoint/restart technique to keep job running during computing resource exchanging. Finally, we verify the feasibility and effectiveness of our solution through LAMMPS and GROMACS applications.

Key words: high performance computing, public cloud, auto scaling, checkpoint/restart, low cost scaling model

TIAN Yongjun,HE Wanqing,SUN Xiangzheng,YU Yang.

HPC cluster and low cost auto

scaling model based on public cloud

[J]. Computer Engineering & Science.

[1]	SUN Yan, ZHANG Jian-min, LI Yuan, SUN Shun-yu. Analysis and evaluation of congestion control in interconnection networks for high performance computing [J]. Computer Engineering & Science, 2024, 46(02): 209-216.
[2]	SHI De-jun, LI Hong-liang, HU Shu-kai . A Clos network based high-radix router structure [J]. Computer Engineering & Science, 2023, 45(12): 2099-2112.
[3]	ZHANG Tian-yang, CHI Cheng-yue, GUO Wu, GAO Yi-qin, WEN Min-hua, WEI Jian-wen . Key techniques and practice on managing multi-site HPC clusters for university campus [J]. Computer Engineering & Science, 2023, 45(12): 2135-2145.
[4]	XIAO Tiao-jie, ZHOU Feng, ZHENG Xuan-yu, LIU Jian, CHEN Lin, LIU Jie, YI Ming-kuan, CHEN Xu-guang, GONG Chun-ye, YANG Bo, GAN Xin-biao, LI Sheng-guo, ZUO Ke, . Large-scale 3D electromagnetic modeling in frequency domain using integration equation method [J]. Computer Engineering & Science, 2023, 45(11): 1901-1910.
[5]	ZHU Wen-long, JIANG Jia-zhi, HUANG Dan, XIAO Nong. ParM: A heterogeneous programming model for domestic processors [J]. Computer Engineering & Science, 2023, 45(09): 1521-1531.
[6]	WU Tie-bin, GUO Feng, WANG Di. A survey of core computing architecture of high performance processors for exascale computing [J]. Computer Engineering & Science, 2023, 45(05): 761-771.
[7]	CHEN Feng-xian. Cluster job runtime prediction based on NR-Transformer [J]. Computer Engineering & Science, 2022, 44(07): 1181-1190.
[8]	WU Jun-nan, OU Yang, LI Yan. Design and implementation of a high performance computing user organization management system based on LAMP#br# #br# [J]. Computer Engineering & Science, 2021, 43(02): 235-241.
[9]	LIU Jie, GONG Chun-ye, YANG Bo, GUO Xiao-wei, GAN Xin-biao, LI Sheng-guo, LI Chao, CHEN Xu-guang, XIAO Tiao-jie, MU Li-an, SONG Min, ZHAO Dong-yong, JU Yu-zhong. YH-ACT：Parallel analysis code of thermohydraulics [J]. Computer Engineering & Science, 2021, 43(01): 58-69.
[10]	LI Zhe, TAN Yusong, LI Bao, YU Jie. Cold start optimization on function computing for high performance computing [J]. Computer Engineering & Science, 2020, 42(11): 1973-1980.
[11]	LI Qiong, SONG Zhen-long, YUAN Yuan, XIE Xu-chao. A regional shared and high concurrent storage architecture based on NVMeoF storage pool [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1711-1719.
[12]	SONG Zhen-long, LI Xiao-fang, LI Qiong, XIE Xu-chao, WEI Deng-ping, DONG Yong, WANG Rui-bo. Improving the performance of BeeGFS parallel file system [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1765-1773.
[13]	FENG Feng, ZHOU Qing-lei, LI Bin. HMAC-SHA1 password recovery based on multi-core FPGA [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1859-1868.
[14]	GAO Xiang, ZHANG Xiang, XU Chuan-fu, LIU Jie, GONG Chun-ye. Research on general mesh generation software for scientific engineering computing [J]. Computer Engineering & Science, 2020, 42(10高性能专刊): 1897-1904.
[15]	ZHENG Wen-xu,PAN Xiao-dong,MA Di,WANG Hao. Overview on the energy efficiency of job scheduling for high performance computing [J]. Computer Engineering & Science, 2019, 41(09): 1526-1533.

HPC cluster and low cost auto

scaling model based on public cloud

PDF

Knowledge

Abstract

Cite this article

share this article

Related Articles 15

Recommended Articles

Metrics

Comments