部分观测马尔科夫决策过程中基于记忆的强化学习问题研究

发布时间：2018-02-11 07:48

本文关键词： 强化学习 U-Tree算法 Sarsa(λ)算法 Q-学习算法部分观测马尔科夫决策过程　出处：《天津工业大学》2017年硕士论文　论文类型：学位论文

【摘要】：在强化学习中,Agent对环境做出动作并从环境得到回报,相应于不同的动作,环境给予的回报值有所不同,通过对到达目标点所作一系列动作的回报值不断强化,Agent能够学到从内部状态到动作的映射,即学到决策过程。传统的U-Tree算法在解决部分观测马尔科夫决策过程(partially observable Markov decision processes,POMDP)的强化学习问题方面已经取得了显著的成效,但因为边缘结点生长的随意性,仍然存在树的规模庞大,内存需求较大,计算复杂度过高的问题。本文在原有U-Tree算法的基础上进行改进,通过获取下一步的观测值,对同一叶结点中做相同动作的实例进行划分,提出了一种基于有效实例扩展边缘结点的(EffectiveInstance U-Tree)算法,简称为EIU-Tree算法。大大缩减了计算规模,从而可以帮助agent更快更好地学习,并在经典的4×3栅格问题中进行了仿真实验,实验表明该算法相对于原有的U-Tree算法有更好的效果。针对U-Tree算法和MU-Tree算法中收敛速度慢的问题,本文中在agent做值迭代的时候,我们用Sarsa(λ)算法更新Q值,提出了一种基于Sarsa(λ)算法的(Sarsa(λ)U-Tree)算法,简称为SU-Tree算法。当agent到达目标状态或惩罚状态时,会对这条路径上所有产生的实例进行Q值的更新,提高了算法的收敛速度。并在4X3方格问题和奶酪迷宫问题中进行了仿真实验,实验表明该算法相对于原有的U-Tree算法和MU-Tree算法,Agent可以更快地找到起点到终点的无震荡路径。
[Abstract]:In reinforcement learning, agents act on the environment and get the return from the environment. According to the different actions, the return value of the environment is different. By continuously reinforcing the return value of a series of actions to the target point, the Agent can learn the mapping from the internal state to the action. The traditional U-Tree algorithm has achieved remarkable results in solving the reinforcement learning problem of partially observing Markov observable Markov decision processes (POMDP), but because of the random growth of edge nodes, There are still the problems of large scale of tree, large memory requirement and high computational complexity. This paper improves on the original U-Tree algorithm and obtains the observation value of the next step. In this paper, a new algorithm called EIU-Tree algorithm based on effective instance to extend edge nodes is proposed, which can greatly reduce the calculation scale and help agent learn more quickly and better, by dividing the cases that do the same action in the same leaf node, and the algorithm is called EIU-Tree algorithm, which is based on extending the edge node of the effective instance, and the algorithm is called EIU-Tree algorithm for short, which greatly reduces the calculation scale and can help agent learn better and faster. The simulation results in the classical 4 脳 3 grid problem show that the algorithm is more effective than the original U-Tree algorithm. For the problem of slow convergence in U-Tree algorithm and MU-Tree algorithm, this paper makes a value iteration in agent. In this paper, we use Sarsa (位) algorithm to update Q value, and propose a Sarsa (位) algorithm based on Sarsa (位) algorithm, which is called SU-Tree algorithm for short. When agent reaches the target state or punishment state, it will update the Q value of all instances generated on this path. Simulation experiments on 4X3 lattice problem and cheese maze problem show that compared with the original U-Tree algorithm and MU-Tree algorithm, the algorithm can find the non-oscillatory path from the starting point to the end point more quickly than the original U-Tree algorithm and the MU-Tree algorithm.
【学位授予单位】：天津工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：O225;TP18

【参考文献】