• 中国计算机学会会刊
  • 中国科技核心期刊
  • 中文核心期刊

计算机工程与科学 ›› 2025, Vol. 47 ›› Issue (01): 140-149.

• 人工智能与数据挖掘 • 上一篇    下一篇

融合强化学习的分阶段策略求解旅行背包问题

章政1,2,夏小云2,陈泽丰3,向毅4   

  1. (1.浙江理工大学计算机科学与技术学院,浙江 杭州 310018;2.嘉兴大学人工智能学院,浙江 嘉兴 314001;
    3.中山大学人工智能学院,广东 珠海 519082;4.华南理工大学软件学院,广东 广州 510006)
  • 收稿日期:2023-08-23 修回日期:2024-04-23 接受日期:2025-01-25 出版日期:2025-01-25 发布日期:2025-01-18
  • 基金资助:
    国家重点研发计划(2023YFC3305900,2023YFC3305903);国家自然科学基金(62206313,61703183);中央高校基本科研业务费专项资金(2024ZYGXZR097);广东省基础与应用基础研究基金(2024A1515030022);浙江省自然科学基金(LGG19F030010);嘉兴大学“勤慎”青年学者培养计划(嘉院人字〔2023〕12号)

A staged strategy incorporating reinforcement learning to solve the travelling thief problem

ZHANG Zheng1,2,XIA Xiaoyun2,CHEN Zefeng3,XIANG Yi4   

  1. (1.School of Computer Science and Technology,Zhejiang Sci-Tech University,Hangzhou 310018;
    2.School of Artificial Intelligence,Jiaxing University,Jiaxing 314001;
    3.School of Artificial Intelligence,Sun Yat-sen University,Zhuhai 519082;
    4.School of Software Engineering,South China University of Technology,Guangzhou 510006,China)n
  • Received:2023-08-23 Revised:2024-04-23 Accepted:2025-01-25 Online:2025-01-25 Published:2025-01-18

摘要: 旅行背包问题TTP是传统的旅行商问题和背包问题的结合,属于NP难问题。相较于独立的旅行商问题和背包问题,旅行背包问题更加符合现实情况,具有更高的研究价值。先前的TTP求解算法主要为启发式算法,性能有限,其他类型的算法则研究较少。为了提高TTP的求解性能,提出了融合强化学习的算法,采用分阶段策略。第1阶段根据物品的属性生成物品选择计划,第2阶段利用强化学习演员-评论家(Actor-Critic)算法求解旅行路径,第3阶段引入邻域搜索策略优化所得解。实验结果表明,所提算法在大部分算例上都取得了较好的结果,并且在部分算例上,解的质量超越了其他对比算法,表明了所提算法具有较优的性能。

关键词: 强化学习, 旅行背包问题, 演员-评论家算法, 组合优化

Abstract: The travelling thief problem (TTP) is a combination of the traditional traveling salesman problem(TSP) and the knapsack problem(KP), which is an NP-hard problem. Compared with the independent TSP and KP, the TTP is more realistic and has higher research value. Previous TTP solving algorithms are mainly heuristic algorithms with limited performance, and other types of algorithms are less studied. To acquire better solution for TTP, a staged strategy of incorporating reinforcement learning is proposed. The first stage generates an item selection plan based on the properties of items. The second stage uses a reinforcement learning algorithm (Actor-Critic algorithm) to solve the travel path. The third stage introduces neighborhood search strategy to optimize the obtained solution. Experiments show that the proposed algorithm achieves good results on most test cases and, in some cases, outperforms the compared algorithms in terms of solution quality, demonstrating the superior performance of the proposed algorithm. 

Key words: reinforcement learning, travelling thief problem (TTP), Actor-Critic algorithm, combinatorial optimizatio