• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2023, Vol. 45 ›› Issue (04): 718-725.

• 人工智能与数据挖掘 • 上一篇    下一篇

基于粒子群优化和深度强化学习的策略搜索方法

彭坤彦,尹翔,刘笑竹,李恒宇   

  1. (扬州大学信息工程学院(人工智能学院),江苏 扬州  225117)
  • 收稿日期:2021-07-08 修回日期:2021-11-15 接受日期:2023-04-25 出版日期:2023-04-25 发布日期:2023-04-13
  • 基金资助:
    江苏省自然科学基金(BK20190878)

A strategy search method based on particle swarm optimization and deep reinforcement learning

PENG Kun-yan,YIN Xiang,LIU Xiao-zhu,LI Heng-yu   

  1. (School of Information Engineering(Artificial Intelligence),Yangzhou University,Yangzhou 225117,China)
  • Received:2021-07-08 Revised:2021-11-15 Accepted:2023-04-25 Online:2023-04-25 Published:2023-04-13

摘要: 深度强化学习DRL算法是一种常用的策略搜索方法,已成功应用于一系列具有挑战性的控制任务。但是,由于DRL难以应对奖励稀疏问题,缺乏有效的探索以及对超参数具有极其敏感的脆弱收敛性,使其难以应用于大规模实际问题。粒子群优化算法PSO是一种进化优化算法,它使用整个episode的累积回报作为适应性值,对奖励稀疏的环境不敏感,且该算法也具有基于种群的多样化探索以及稳定的收敛性,但样本效率低。因此,提出了PSO-RL算法,结合PSO和基于策略梯度的离策略DRL算法,DRL通过PSO种群提供的多种数据来训练种群中累积奖励最低的几个策略,并且每次都将训练后累积奖励得到提升的策略插入PSO种群,增强DRL与PSO之间的信息交流。PSO-RL算法能够提升PSO的样本效率,而且能够改善DRL算法的性能和稳定性。在pybullet模块具有挑战性的连续控制任务中的实验结果表明,PSO-RL的性能不仅优于DRL的,且优于进化强化学习算法的。

关键词: 粒子群优化, 策略搜索, 深度强化学习, 策略梯度, 强化学习

Abstract: Deep Reinforcement Learning (DRL) algorithm is a popular policy search method and has been successfully applied to a series of challenging control tasks. However, DRL is difficult to be applied to large-scale practical problems due to its difficulty in dealing with reward sparseness, lack of effective exploration and fragile convergence sensitive to hyperparameters. Particle Swarm Optimization (PSO) is an evolutionary optimization method, which uses the cumulative rewards of the entire episode as the fitness value and is insensitive to the environment with sparse rewards. Moreover, this method also has population-based diversification exploration and stable convergence, but the sample efficiency is low. In this paper, PSO and DRL based on policy gradient are combined. DRL trains the policies with the lowest cumulative rewards in the population through a variety of data provided by the PSO population, and every time the policies with improved cumulative rewards after training is inserted into the PSO population to enhance the information exchange between DRL and PSO population. This algorithm, called PSO-RL, can improve the sample efficiency of PSO and improve the performance and stability of DRL algorithm. Experiments on the challenging continuous control task of the PyBullet module show that PSO-RL performs better than both DRL and the evolutionary reinforcement learning  algorithm.

Key words: particle swarm optimization, strategy search, deep reinforcement learning, policy gradient, reinforcement learning