• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学

• • 上一篇    下一篇

基于引导性探索的离线强化学习算法

周娴玮, 罗仕鑫, 王宇翔, 余松森


  

  1. (华南师范大学人工智能学院,广东 佛山 528225)

  • 出版日期:2025-12-10 发布日期:2025-12-10

Offline reinforcement learning with guided exploration

Zhou Xianwei, Luo Shixin, Wang Yuxiang, Yu Songsen   

  1. (School of Artificial Intelligence, South China Normal University, Foshan 528225, China)

  • Online:2025-12-10 Published:2025-12-10

摘要: 离线强化学习旨在让智能体利用历史数据进行策略学习,不进行在线交互,从而降低成本并规避现实场景中的风险。然而,由于缺少环境反馈,从离线数据集中学习到的策略会遇到数据分布偏移的问题。现有的方法大多基于保守主义,通过将策略学习的范围限制在离线数据集分布内,一定程度上缓解了分布偏移问题,但同时也限制了智能体的探索能力。针对以上问题,提出了一种基于引导性探索的方法。该方法利用引导性状态网络生成高潜在价值的邻近状态,引导智能体探索分布外状态。同时,引入行为克隆项以动态调整行为策略与学习策略之间的差距,确保学习过程的稳定性。在D4RL基准测试集上的实验结果表明,该算法在多个任务上的表现优于现有主流算法。

关键词: 离线强化学习, 分布偏移, 引导性探索

Abstract: Offline Reinforcement Learning aims to enable agents to leverage historical data for policy learning without online interaction, thereby reducing costs and avoiding risks in real-world scenarios. However, due to the lack of environmental feedback, policies learned from offline datasets encounter the issue of data distribution shift. Most existing methods are based on conservatism, which alleviates the distribution shift problem to some extent by confining policy learning within the distribution of the offline dataset, yet this approach overly restricts the agent's exploration capabilities. To address the above issue, this paper proposes a method based on guided exploration. This method utilizes a guided state network to generate neighboring states with high potential value, guiding the agent to explore states outside the distribution. Meanwhile, a behavior cloning term is introduced to dynamically adjust the gap between the behavioral strategy and the learning strategy, ensuring the stability of the learning process. Experimental results on the D4RL benchmark dataset demonstrate that this algorithm outperforms existing mainstream algorithms on multiple tasks. 


Key words: Offline reinforcement learning, Distribution shift, Guided exploration