基于SARSA算法的足球机器人决策系统的研究与设计
发布时间:2018-12-08 21:20
【摘要】:Robo Cup 2D仿真机器人足球比赛平台是多智能体机器人系统研究的一种平台,研究人员可以在该平台上测试不同的机器学习算法。强化学习是机器学习算法中的重要算法之一,它允许智能体通过与环境不断地进行交互以获得最大的累积奖励回报。在一定的条件下,强化学习可以保证智能体的学习能够收敛到最优策略上。强化学习已经被广泛应用于围棋、五子棋、俄罗斯方块、虚幻竞技场等游戏当中并取得了成功,但是它在Robo Cup 2D仿真比赛中并没有被充分研究。本文将SARSA算法引入到Robo Cup 2D仿真比赛中,并对其进行改进。根据防守球员的位置和球的位置对球员智能体的状态空间进行映射,并根据空间状态的映射获得其对应的前提条件函数,作为SARSA算法进行动作选择的依据,对SARSA算法在Helios框架中进行了设计与实现。基于足球领域知识,本文提出了两种基于领域知识的奖励修正函数,包括基于球队分散度的奖励修正函数和基于足球转移距离的奖励修正函数,以使球队有更好的表现。在多智能体系统中,单智能体独立地进行强化学习得到Q表往往是稀疏的,无法代表整个系统的全局情况,为了解决这种问题,本文对多智能体共享Q表的方法进行了研究,并提出了多Q表融合算法,使得球队在比赛中获得更高的胜率。由于强化学习算法的设计需要保证Q表的收敛,本文首先对比了自适应?-greedy动作选择策略与固定?-greedy动作选择策略的收敛性,并最终选择了能够收敛的自适应?-greedy动作选择策略;然后对于奖励回报函数的设计本文对比了不同奖励值对进球得分的影响,确定了正确的奖励值,并对比了SARSA算法在引入两种奖励修正后球队的胜率,实验证明奖励修正的引入有利于提高球队胜率;最后与参加Robo Cup 2D的球队进行了多场比赛,并对比赛结果进行了统计分析,验证了本文算法的有效性。
[Abstract]:Robo Cup 2D simulation robot soccer competition platform is a platform for the research of multi-agent robot system. Researchers can test different machine learning algorithms on the platform. Reinforcement learning is one of the most important algorithms in machine learning. It allows agents to interact with the environment continuously to obtain the maximum cumulative reward. Under certain conditions, reinforcement learning can ensure that agent learning converges to the optimal strategy. Reinforcement learning has been widely used in games such as go, Gobang, Tetris, Unreal Arena and so on, but it has not been fully studied in Robo Cup 2D simulation competition. This paper introduces SARSA algorithm into Robo Cup 2D simulation competition and improves it. According to the position of the defensive player and the position of the ball, the state space of the player agent is mapped, and the corresponding precondition function is obtained according to the mapping of the space state, which is used as the basis for the action selection of the SARSA algorithm. The SARSA algorithm is designed and implemented in the framework of Helios. Based on football domain knowledge, this paper proposes two kinds of reward correction functions based on domain knowledge, including one based on team dispersion and one based on football transfer distance, so as to make the team perform better. In order to solve this problem, the method of multi-agent sharing Q table is studied in this paper, in order to solve this problem, the Q table is often sparse and cannot represent the overall situation of the whole system by the single-agent reinforcement learning independently, in order to solve this problem, this paper studies the method of multi-agent sharing Q table. A multi-Q-table fusion algorithm is proposed to make the team win higher in the match. Since the design of reinforcement learning algorithm needs to ensure the convergence of Q table, this paper first compares the convergence of adaptive-greedy action selection strategy and fixed-greedy action selection strategy. Finally, the adaptive greedy action selection strategy which can converge is selected. Then for the design of reward and reward function, this paper compares the effect of different reward values on goal score, determines the correct reward value, and compares the winning rate of the team after introducing two kinds of reward correction by SARSA algorithm. Experimental results show that the introduction of reward correction is beneficial to improve the winning rate of the team. Finally, several games were conducted with the team participating in Robo Cup 2D, and the results were statistically analyzed to verify the effectiveness of this algorithm.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP242
本文编号:2369012
[Abstract]:Robo Cup 2D simulation robot soccer competition platform is a platform for the research of multi-agent robot system. Researchers can test different machine learning algorithms on the platform. Reinforcement learning is one of the most important algorithms in machine learning. It allows agents to interact with the environment continuously to obtain the maximum cumulative reward. Under certain conditions, reinforcement learning can ensure that agent learning converges to the optimal strategy. Reinforcement learning has been widely used in games such as go, Gobang, Tetris, Unreal Arena and so on, but it has not been fully studied in Robo Cup 2D simulation competition. This paper introduces SARSA algorithm into Robo Cup 2D simulation competition and improves it. According to the position of the defensive player and the position of the ball, the state space of the player agent is mapped, and the corresponding precondition function is obtained according to the mapping of the space state, which is used as the basis for the action selection of the SARSA algorithm. The SARSA algorithm is designed and implemented in the framework of Helios. Based on football domain knowledge, this paper proposes two kinds of reward correction functions based on domain knowledge, including one based on team dispersion and one based on football transfer distance, so as to make the team perform better. In order to solve this problem, the method of multi-agent sharing Q table is studied in this paper, in order to solve this problem, the Q table is often sparse and cannot represent the overall situation of the whole system by the single-agent reinforcement learning independently, in order to solve this problem, this paper studies the method of multi-agent sharing Q table. A multi-Q-table fusion algorithm is proposed to make the team win higher in the match. Since the design of reinforcement learning algorithm needs to ensure the convergence of Q table, this paper first compares the convergence of adaptive-greedy action selection strategy and fixed-greedy action selection strategy. Finally, the adaptive greedy action selection strategy which can converge is selected. Then for the design of reward and reward function, this paper compares the effect of different reward values on goal score, determines the correct reward value, and compares the winning rate of the team after introducing two kinds of reward correction by SARSA algorithm. Experimental results show that the introduction of reward correction is beneficial to improve the winning rate of the team. Finally, several games were conducted with the team participating in Robo Cup 2D, and the results were statistically analyzed to verify the effectiveness of this algorithm.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP242
【参考文献】
相关博士学位论文 前1条
1 柏爱俊;基于马尔科夫理论的不确定性规划和感知问题研究[D];中国科学技术大学;2014年
,本文编号:2369012
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2369012.html