贝叶斯强化学习中策略迭代算法研究

发布时间：2018-07-10 07:50

本文选题：贝叶斯强化学习 + 策略迭代　；参考：《苏州大学》2016年硕士论文

【摘要】：贝叶斯强化学习是基于贝叶斯技术,利用概率分布对值函数、策略和环境模型等参数进行建模,求解强化学习相关任务,其主要思想是利用先验分布估计未知参数的不确定性,然后通过获得的观察信息计算后验分布来学习知识。基于此,本文以策略迭代方法为框架,提出三种改进的基于贝叶斯推理和策略迭代的强化学习算法:(1)针对传统的贝叶斯强化学习算法在学习未知的环境模型时,不能动态地控制环境模型学习次数的缺陷,提出一种基于贝叶斯智能模型学习的策略迭代算法。一方面,算法在模型学习部分利用Dirichlet分布方差阈值决定是否需要继续学习模型,既保证模型学习的充分性,又降低模型学习的无效率。另一方面,算法在策略学习时利用探索激励因子为选取探索动作提供保障,同时,也使得模型学习能够遍历所有状态动作对,确保算法收敛。模型学习和策略学习相辅相成,使得算法收敛到最优策略。(2)针对传统的强化学习算法无法高效地解决动作探索与利用的平衡问题,提出一种基于动作值函数(Q值函数)概率估计的异步策略迭代算法。在策略评估部分,利用高斯伽玛分布对Q值函数进行建模,基于先验分布和观察的数据求解Q值函数后验,评估策略好坏。在策略改进部分,基于Q值函数后验分布,利用Myopic-VPI求解最优动作,保证动作探索与利用达到平衡。最后,算法采用异步更新方法,倾向于计算与策略相关的动作值函数,提高算法收敛速度。(3)针对传统的策略迭代算法无法高效地解决状态连续的且环境模型未知的MDP问题,提出一种基于高斯过程时间差分的在线策略迭代算法。主要利用高斯过程和时间差分公式对动作值函数进行建模,结合贝叶斯推理,求解值函数空间的后验分布。在学习过程中,依据在线学习算法的特性及时评估改进后的策略,边学习边改进。在一定程度上,所提算法可以完成连续状态空间下强化学习任务且收敛速度较快。
[Abstract]:Bayesian reinforcement learning is based on Bayesian technology, using probability distribution to model value function, strategy and environment model, and solving reinforcement learning related tasks. The main idea of Bayesian reinforcement learning is to use prior distribution to estimate the uncertainty of unknown parameters. Then the knowledge is learned by calculating the posteriori distribution of the observed information. Based on this, three improved reinforcement learning algorithms based on Bayesian reasoning and strategy iteration are proposed in this paper. (1) for the traditional Bayesian reinforcement learning algorithm, when learning unknown environment model, This paper presents a strategy iterative algorithm based on Bayesian intelligent model learning, which can not control the learning times of environment model dynamically. On the one hand, in the part of model learning, the threshold of Dirichlet distribution variance is used to determine whether to continue learning the model, which not only guarantees the adequacy of model learning, but also reduces the inefficiency of model learning. On the other hand, the search incentive factor is used to guarantee the selection of the exploration action in the strategy learning process. At the same time, the model learning can traverse all state action pairs to ensure the convergence of the algorithm. Model learning and strategy learning complement each other, which makes the algorithm converge to the optimal strategy. (2) the traditional reinforcement learning algorithm can not effectively solve the balance problem of action exploration and utilization. An asynchronous strategy iterative algorithm based on the probability estimation of action value function (Q valued function) is proposed. In the part of strategy evaluation, the Q value function is modeled by Gao Si gamma distribution, and the posteriori of Q value function is solved based on the prior distribution and observation data, and the evaluation strategy is good or bad. In the part of strategy improvement, based on the posteriori distribution of Q value function, Myopic-VPI is used to solve the optimal action to ensure the balance between the exploration and utilization of the action. Finally, the algorithm adopts asynchronous updating method, which tends to calculate the action value function related to the strategy, and improves the convergence speed of the algorithm. (3) the traditional strategy iterative algorithm can not efficiently solve the MDP problem with continuous state and unknown environment model. An online policy iterative algorithm based on Gao Si process time difference is proposed. The action value function is modeled by Gao Si process and time difference formula, and the posteriori distribution of value function space is solved by combining Bayesian reasoning. In the process of learning, the improved strategy is evaluated according to the characteristics of the online learning algorithm. To some extent, the proposed algorithm can accomplish reinforcement learning tasks in continuous state space and converge faster.
【学位授予单位】：苏州大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP181

【相似文献】