• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (12): 2129-2138.

• 高性能计算 • 上一篇    下一篇

OpenLM:多平台高性能的大语言模型推理框架

刘高,徐建良,张先轶,刘贤冬   

  1. (1.中国海洋大学信息科学与工程学部,山东  青岛 266100;2.澎峰(北京)科技有限公司,北京 100080) 

  • 收稿日期:2025-02-20 修回日期:2025-03-04 出版日期:2025-12-25 发布日期:2026-01-06

OpenLM: A multi-platform and high-performance large language model inference framework

LIU Gao,XU Jianliang,ZHANG Xianyi,LIU Xiandong   

  1. (1.Faculty of Information Science and Engineering,Ocean University of China,Qingdao 266100;
    2.Peng Feng(Beijing) Technology Co.,Ltd.,Beijing 100080,China)
  • Received:2025-02-20 Revised:2025-03-04 Online:2025-12-25 Published:2026-01-06

摘要: 随着计算设备种类的增加和计算能力的迅速提升,以及模型数量的不断增加,在多平台上实现多模型的高效推理已成为一项复杂且艰巨的任务。为应对这一挑战,开发了OpenLM框架,该框架旨在在多个平台上快速实现多模型的高性能推理支持。OpenLM框架具备广泛的模型兼容性,内置了多平台和多架构的高性能计算算子,以最大限度发挥硬件性能。同时,OpenLM拥有灵活的框架结构,便于快速集成和支持最新的模型。为进一步优化推理过程中的显存和内存消耗、任务调度与系统稳定性,框架中引入了分页注意力机制、动态批处理、权重量化和KV cache量化等特性。经实验证明,上述优化策略能够有效提升推理效率,并降低资源开销,以增强框架的整体性能。


关键词: 深度学习, 大语言模型, 高性能计算, 大模型推理框架

Abstract: As computational devices continue to diversify and computational power grows rapidly, the increasing number of large language models (LLMs) has made efficient multi-model inference across heterogeneous platforms a complex and formidable challenge. To address this, we propose  OpenLM, a high-performance inference framework to support efficient deployment of multiple LLMs on diverse hardware platforms. The OpenLM framework boasts extensive model compatibility, providing efficient performance support for a wide range of models. It incorporates high-performance computing operators optimized for multiple platforms and architectures to maximize hardware performance. Meanwhile, OpenLM features a flexible framework architecture that facilitates rapid integration and support for the latest models. To further optimize memory (both GPU and CPU memory) consumption, task scheduling, and system stability during the inference process, the framework introduces features such as Paged- Attention mechanisms, dynamic batching, weight quantization, and KV cache quantization. According to the experimental results, these optimization strategies effectively enhance inference efficiency, reduce resource overhead, and bolster overall framework performance.


Key words: deep learning, large language model (LLM), high-performance computing (HPC), LLM inference framework