• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2024, Vol. 46 ›› Issue (06): 1081-1091.

• 人工智能与数据挖掘 • 上一篇    下一篇

对抗环境中基于种群多样性的鲁棒策略生成方法

庄述鑫1,陈永红2,郝一行2,吴巍炜1,徐学永3,王万元1   

  1. (1.东南大学计算机科学与工程学院,江苏 南京 211189;
    2.沈阳飞机设计研究所扬州协同创新研究院有限公司,江苏 扬州 210016;
    3.北方信息控制研究院集团有限公司,江苏 南京 211189)
  • 收稿日期:2023-10-12 修回日期:2023-12-05 接受日期:2024-06-25 出版日期:2024-06-25 发布日期:2024-06-18

A population diversity-based robust policy generation method in adversarial game environments#br#

ZHUANG Shu-xin1,CHEN Yong-hong2,HAO Yi-hang2,WU Wei-wei1,XU Xue-yong3,WANG Wan-yuan1#br#   

  1. (1.School of Computer Science and Engineering,Southeast University,Nanjing 211189;
    2.Shenyang Aeroengine Design and Research Institute,
    Yangzhou Collaborative Innovation Research Institute Co.,Ltd.,Yangzhou 210016;
    3.Nanjing North Information Industrialization Group Co.,Ltd.,Nanjing 211189,China)
  • Received:2023-10-12 Revised:2023-12-05 Accepted:2024-06-25 Online:2024-06-25 Published:2024-06-18

摘要:

在对抗博弈环境中,目标智能体希望生成具有高鲁棒性的博弈策略,使得目标智能体在面对不同对手策略时,始终具有较高的收益。现有的基于自我博弈的策略生成方法通常会过拟合到针对对手某个特定策略进行学习,所学习到的策略鲁棒性低且容易受到其他对手策略的攻击。此外,现有的结合深度强化学习和博弈论方法迭代生成对手策略的方法在复杂且具有庞大决策空间的对抗场景下收敛效率低。鉴于此,提出一种基于种群多样性的鲁棒策略生成方法,其中对抗双方各自维护一个种群策略池,并且需要保证种群中的策略是具有多样性的,以此生成鲁棒的目标策略。为了保证种群多样性,将从策略的行为和质量2个视角度量策略的多样性,其中行为多样性是指不同策略状态-动作轨迹的差异性,质量多样性是指不同策略面对相同对手时最终获得的收益的差异性。最后,在典型的具有连续状态、连续动作的对抗环境中验证了所提出的基于种群多样性所生成的策略的鲁棒性。


关键词: 对抗环境, 深度强化学习, 种群多样性, Shapley value, 行为表征

Abstract: In adversarial game environments, the objective agent aims to generate robust game policies, ensuring high returns when facing different opponent policies consistently. Existing self-play-based policy generation methods often overfit to learning against a specific opponent policy, resulting in low robustness and vulnerability to attacks from other opponent policies. Additionally, existing methods that combine deep rein-forcement learning and game theory to iteratively generate opponent policies have low convergence efficiency in complex adversarial scenarios with large decision spaces. To address these challenges, a population diversity-based robust policy generation method is proposed. In this method, both adversaries maintain a policy population pool, ensuring diversity within the population to generate a robust target policy. To ensure population diversity, policy diversity is measured from two perspectives: behavioral and quality diversity. Behavioral diversity refers to the differences in state-action trajectories of different policies, while quality diversity refers to the differences in the returns obtained when facing the same opponent. Finally, the robustness of the policies generated based on population diversity is validated in typical adversarial environments with continuous stateaction spaces.


Key words: