基于泛函梯度的策略梯度方法的研究

发布时间：2019-04-29 17:02

【摘要】：强化学习是机器学习的重要研究方向之一,旨在使智能体通过与环境交互,不断改进自身策略,最大化收到的累计奖赏。经典的强化学习方法多基于值函数,但是基于值函数的方法对于连续动作的任务难以处理,并且有"策略退化"现象。因此近些年来基于策略搜索的方法得到显著发展。策略梯度方法是策略搜索的一类重要方法,基于策略参数梯度来更新策略。在策略梯度方法中,策略往往使用线性模型表示,导致系统受到线性模型表示能力有限的约束。而泛函梯度在监督学习中能够用于产生非参模型,基于泛函梯度的Boosting类方法已成为监督学习代表性方法之一。然而泛函梯度在强化学习中研究较少。本文就泛函梯度在策略梯度方法中的使用开展研究,主要作出了以下工作:首先,设计了基于泛函梯度的策略梯度方法PolicyBoost,可学习决策树等复杂模型的组合,避免了以往需要手动设计线性特征的缺点。其次,本文证明了在一定条件下,PolicyBoost的收敛性。针对理论分析得到可能出现的过拟合现象,通过引入基线和构建采样池,缓解了过拟合的问题。最后,本文在强化学习中的经典任务Mountain Car、Acrobot、以及具有挑战性的直升机悬停控制任务的实验,验证了提出的算法效果优良并且稳定。
[Abstract]:Reinforcement learning is one of the important research directions of machine learning, which aims to make agents improve their own strategies and maximize the accumulated reward by interacting with the environment. Most of the classical reinforcement learning methods are based on the value function, but the method based on the value function is difficult to deal with the task of continuous action, and has the phenomenon of "policy degradation". Therefore, the strategy-based search method has been developed significantly in recent years. Policy gradient method is one of the most important methods in policy search, which updates the strategy based on the policy parameter gradient. In the strategy gradient method, the strategy is usually represented by linear model, which results in the system being constrained by the limited representation ability of linear model. Functional gradient can be used to generate non-parametric models in supervised learning. The Boosting class method based on functional gradient has become one of the representative methods of supervised learning. However, there is little research on functional gradient in reinforcement learning. In this paper, the use of functional gradient method in strategic gradient method is studied. The main work is as follows: firstly, the combination of PolicyBoost, learning decision tree and other complex models based on functional gradient method is designed. It avoids the disadvantage of manual design of linear features in the past. Secondly, we prove the convergence of PolicyBoost under certain conditions. Aiming at the possible over-fitting phenomenon in theoretical analysis, the over-fitting problem is alleviated by introducing the baseline and constructing the sample pool. Finally, the experiments of classical task Mountain Car,Acrobot, and challenging helicopter hover control task in reinforcement learning show that the proposed algorithm is effective and stable.
【学位授予单位】：南京大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP181

【相似文献】