基于凸多面体抽象域的自适应强化学习技术研究

发布时间：2018-12-27 14:57

【摘要】：表格驱动的算法是解决强化学习问题的一类重要方法,但由于"维数灾"现象的存在,这种方法不能直接应用于解决具有连续状态空间的强化学习问题.解决维数灾问题的方法主要包括两种:状态空间的离散化和函数近似方法.相比函数近似,基于连续状态空间离散化的表格驱动方法具有原理直观、程序结构简单和计算轻量化的特点.基于连续状态空间离散化方法的关键是发现合适的状态空间离散化机制,平衡计算量及准确性,并且确保基于离散抽象状态空间的数值性度量,例如V值函数和Q值函数,可以较为准确地对原始强化学习问题进行策略评估和最优策略π*计算.文中提出一种基于凸多面体抽象域的自适应状态空间离散化方法,实现自适应的基于凸多面体抽象域的Q(λ)强化学习算法(Adaptive Polyhedra Domain based Q(λ),APDQ(λ)).凸多面体是一种抽象状态的表达方法,广泛应用于各种随机系统性能评估和程序数值性属性的验证.这种方法通过抽象函数,建立具体状态空间至多面体域的抽象状态空间的映射,把连续状态空间最优策略的计算问题转化为有限大小的和易于处理的抽象状态空间最优策略的计算问题.根据与抽象状态相关的样本集信息,设计了包括BoxRefinement、LFRefinement和MVLFRefinement多种自适应精化机制.依据这些精化机制,对抽象状态空间持续进行适应性精化,从而优化具体状态空间的离散化机制,产生符合在线抽样样本空间所蕴涵的统计奖赏模型.基于多面体专业计算库PPL(Parma Polyhedra Library)和高精度数值计算库GMP(GNU Multiple Precision)实现了算法APDQ(λ),并实施了实例研究.选择典型的连续状态空间强化学习问题山地车(Mountain Car,MC)和杂技机器人(Acrobatic robot,Acrobot)作为实验对象,详细评估了各种强化学习参数和自适应精化相关的阈值参数对APDQ(λ)性能的影响,探究了抽象状态空间动态变化情况下各种参数在策略优化过程中的作用机理.实验结果显示当折扣率γ大于0.7时,算法展现出较好的综合性能,在初期,策略都快速地改进,后面的阶段平缓地趋向收敛(如图6~图13所示),并且对学习率α和各种抽象状态空间精化参数都具有较好的适应性;当折扣率γ小于0.6时,算法的性能衰退较快.抽象解释技术用于统计学习过程是一种较好的解决连续强化学习问题的思想,有许多问题值得进一步研究和探讨,例如基于近似模型的采样和值函数更新等问题.
[Abstract]:Table-driven algorithm is an important method to solve reinforcement learning problem. However, due to the existence of "dimensionality disaster", this method can not be directly applied to solve reinforcement learning problem with continuous state space. There are two methods to solve the problem of dimensionality disaster: discretization of state space and approximation of function. Compared with the function approximation, the table-driven method based on continuous state space discretization has the advantages of intuitive principle, simple program structure and lightweight calculation. The key of discretization method based on continuous state space is to find appropriate discretization mechanism of state space, balance computation and accuracy, and ensure numerical measures based on discrete abstract state space, such as V value function and Q value function. It is possible to evaluate the original reinforcement learning problem and calculate the optimal strategy 蟺 * accurately. In this paper, an adaptive state space discretization method based on convex polyhedron abstract domain is proposed. The adaptive Q (位) reinforcement learning algorithm (Adaptive Polyhedra Domain based Q (位), APDQ (位). Based on convex polyhedron abstract domain is implemented. Convex polyhedron is an abstract state representation method, which is widely used to evaluate the performance of random systems and verify the numerical properties of programs. The mapping of concrete state space to the abstract state space of polyhedron domain is established by abstract function. The computation problem of continuous state space optimal strategy is transformed into a finite size and easy to deal with the computation problem of abstract state space optimal policy. According to the sample set information related to abstract state, several adaptive refinement mechanisms including BoxRefinement,LFRefinement and MVLFRefinement are designed. According to these refinement mechanisms, the abstract state space is continuously refined adaptively to optimize the discretization mechanism of the specific state space, and to produce a statistical reward model consistent with the sample space of online sampling. The algorithm APDQ (位) is realized based on the polyhedron professional computing library PPL (Parma Polyhedra Library) and the high precision numerical calculation library GMP (GNU Multiple Precision), and a case study is carried out. The typical continuous state space reinforcement learning problem (Mountain Car,MC) and acrobatics robot (Acrobatic robot,Acrobot) were selected as experimental objects. The effects of various reinforcement learning parameters and threshold parameters related to adaptive refinement on the performance of APDQ (位) are evaluated in detail, and the mechanism of various parameters in the process of policy optimization under the dynamic change of abstract state space is explored. The experimental results show that when the discount rate 纬 is greater than 0.7, the algorithm shows good comprehensive performance. In the initial stage, the strategy is improved quickly, and the later stage converges gently (as shown in figs. 6 ~ 13). And it has good adaptability to learning rate 伪 and various abstract state space refinement parameters. When the discount rate 纬 is less than 0.6, the performance of the algorithm declines rapidly. Abstract interpretation technology used in statistical learning process is a good idea to solve the continuous reinforcement learning problem. There are many problems worthy of further study and discussion, such as sampling based on approximate model and value function updating and so on.
【作者单位】：苏州大学计算机科学与技术学院;符号计算与知识工程教育部重点实验室(吉林大学);
【基金】：国家自然科学基金项目(61272005,61303108,61373094,61472262,61502323,61502329) 江苏省自然科学基金项目(BK2012616) 江苏省高校自然科学研究项目(13KJB520020) 吉林大学符号计算与知识工程教育部重点实验室项目(93K172014K04) 苏州市应用基础研究计划项目(SYG201422) 苏州大学高校省级重点实验室基金项目(KJS1524) 中国国家留学基金项目(201606920013) 浙江省自然科学基金(LY16F010019)资助~~
【分类号】：TP181

【相似文献】