• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• 高性能计算 • 上一篇    下一篇

基于公共云的 HPC 集群实现及自动伸缩闲时计算研究

田永军,何万青,孙相征,余洋   

  1. (阿里云计算有限公司,浙江 杭州 310024)
  • 收稿日期:2018-10-17 修回日期:2018-12-21 出版日期:2019-07-25 发布日期:2019-07-25

HPC cluster and low cost auto
scaling model based on public cloud

TIAN Yongjun,HE Wanqing,SUN Xiangzheng,YU Yang   

  1. (Alibaba Cloud Computing Co.Ltd.,Hangzhou 310024,China)
  • Received:2018-10-17 Revised:2018-12-21 Online:2019-07-25 Published:2019-07-25

摘要:

对于HPC用户来说,计算成本是迁云所考虑的重要因素之一,阿里云上提供的抢占式实例,是一种按需实例,旨在降低使用公共云计算资源成本,抢占式实例市场价格是波动的,通常远低于正常的按需实例,甚至达到正常按需实例的一折。抢占式实例一般会在创建时为用户保留一段最短时间,过后有可能会被释放,所以一般适用于无状态的应用场景。提出在公共云上的自动伸缩策略,其面向通用的HPC集群调度器,基于用户的应用软件类型、提交作业规律以及用户对性能和成本等多方面需求,自动在云上部署扩容计算资源,控制成本。对用户来说,可以做到”only pay for what you want and what you use”。基于公共云上丰富的资源规格类型和售卖方式,利用自动伸缩服务,抢占式实例,断点续算等技术可以配置低成本的公共云上HPC自动伸缩方案:用户提交作业的同时可以指定成本上限,自动伸缩服务自动在低于此成本的前提下寻找和扩容抢占式计算资源,同时利用断点续算功能保证作业在计算资源切换的时候可以继续运算。最后,通过 LAMMPS 和 GROMACS 两个高性能应用实例验证了该策略的可行性和有效性。
 

关键词: 高性能计算, 公共云, 自动伸缩, 断点续算, 闲时计算伸缩模型

Abstract:

For many HPC users, computing cost is one important factor for whether moving workloads to the public cloud. Alibaba cloud provides “preemptible instance”. It is an on-demand instance to reduce the cost of using public cloud computing resources. The market price of “preemptible instance” fluctuates and it can be as low as 10% of “pay as you go instance”. And “preemptible instance” cannot be kept as long as users’ requirement, and be released due to datacenter scheduler or some other reasons, so it can be used in some stateless scenarios. On the public cloud, based on users’ application types, job submission patterns, performance requirements, timing and cost, we propose an auto scaling strategy on the public cloud for general HPC cluster schedulers, which can automatically deploy computing resources and control cost. HPC users only pay for what they want and what they use. Due to abundant resource types and resource rent models, and taking advantages of auto scaling service, “preemptible instance” and application checkpoint/restart, we can supply a low cost auto scaling model. When users submit jobs, they can set their expectation cost, and the auto scaling service will find the “preemptible instance” under this cost setting, and use checkpoint/restart technique to keep job running during computing resource exchanging. Finally, we verify the feasibility and effectiveness of our solution through LAMMPS and GROMACS applications.

Key words: high performance computing, public cloud, auto scaling, checkpoint/restart, low cost scaling model