• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

Computer Engineering & Science ›› 2025, Vol. 47 ›› Issue (12): 2129-2138.

• High Performance Computing • Previous Articles     Next Articles

OpenLM: A multi-platform and high-performance large language model inference framework

LIU Gao,XU Jianliang,ZHANG Xianyi,LIU Xiandong   

  1. (1.Faculty of Information Science and Engineering,Ocean University of China,Qingdao 266100;
    2.Peng Feng(Beijing) Technology Co.,Ltd.,Beijing 100080,China)
  • Received:2025-02-20 Revised:2025-03-04 Online:2025-12-25 Published:2026-01-06

Abstract: As computational devices continue to diversify and computational power grows rapidly, the increasing number of large language models (LLMs) has made efficient multi-model inference across heterogeneous platforms a complex and formidable challenge. To address this, we propose  OpenLM, a high-performance inference framework to support efficient deployment of multiple LLMs on diverse hardware platforms. The OpenLM framework boasts extensive model compatibility, providing efficient performance support for a wide range of models. It incorporates high-performance computing operators optimized for multiple platforms and architectures to maximize hardware performance. Meanwhile, OpenLM features a flexible framework architecture that facilitates rapid integration and support for the latest models. To further optimize memory (both GPU and CPU memory) consumption, task scheduling, and system stability during the inference process, the framework introduces features such as Paged- Attention mechanisms, dynamic batching, weight quantization, and KV cache quantization. According to the experimental results, these optimization strategies effectively enhance inference efficiency, reduce resource overhead, and bolster overall framework performance.


Key words: deep learning, large language model (LLM), high-performance computing (HPC), LLM inference framework