基于改进ELM的递归最小二乘时序差分强化学习算法及其应用

发布时间：2018-01-16 03:04

本文关键词：基于改进ELM的递归最小二乘时序差分强化学习算法及其应用　出处：《化工学报》2017年03期 　论文类型：期刊论文

【摘要】：针对值函数逼近算法对精度及计算时间等要求,提出了一种基于改进极限学习机的递归最小二乘时序差分强化学习算法。首先,将递推方法引入到最小二乘时序差分强化学习算法中消去最小二乘中的矩阵求逆过程,形成递推最小二乘时序差分强化学习算法,减少算法的复杂度及其计算量。其次,考虑到LSTD(0)算法收敛速度慢,加入资格迹增加样本利用率提高收敛速度的算法,形成LSTD(λ)算法,以保证在经历过相同数量的轨迹后能收敛于真实值。同时,考虑到大部分强化学习问题的值函数是单调的,而传统ELM方法通常运用具有双侧抑制特性的Sigmoid激活函数,增大了计算成本,提出采用具有单侧抑制特性的Softplus激活函数代替传统Sigmoid函数,以减少计算量提高运算速度,使得该算法在提高精度的同时提高了计算速度。通过与传统基于径向基函数的最小二乘强化学习算法和基于极限学习机的最小二乘TD算法在广义Hop-world问题的对比实验,比较结果证明了所提出算法在满足精度的条件下有效提高了计算速度,甚至某些条件下精度比其他两种算法更高。
[Abstract]:According to the value of algorithm on the accuracy and computational time requirements of function approximation, and proposes an improved extreme learning machine differential sequential recursive least squares algorithm based on reinforcement learning. First, the recursive method is introduced into the least squares temporal difference reinforcement learning algorithm in the least squares matrix elimination in the inverse process, the formation of the recursive least squares temporal difference the reinforcement learning algorithm to reduce the complexity of the algorithm and computation. Secondly, considering the LSTD (0) the slow convergence of the algorithm, adding eligibility rate of increase to improve the convergence speed of the algorithm by using the sample, the formation of LSTD (lambda) algorithm, to ensure that experienced in the same number of trajectories can converge to the true value. At the same time. Taking into account the most intensive value function learning problem is monotone, while the traditional ELM method is usually used with bilateral inhibition of Sigmoid activation function, increases the computation cost, mining equipment Unilateral suppression Softplus activation function to replace the traditional Sigmoid function, to reduce the amount of computation and improve the speed, so that the algorithm can improve the accuracy and speed of calculation is improved. Compared with the traditional least squares based on radial basis function and reinforcement learning algorithm based on least square algorithm TD limit experiment machine learning in the generalized Hop-world problem. The comparison results show that the proposed algorithm can meet the precision in calculation speed under the condition improved, even under certain conditions with greater accuracy than the other two algorithms.

【作者单位】：北京化工大学信息科学与技术学院;
【基金】：国家自然科学基金项目(61573051,61472021) 软件开发环境国家重点实验室开放课题(SKLSDE-2015KF-01) 中央高校基本科研业务费专项资金项目(PT1613-05)~~
【分类号】：TP181
【正文快照】： 引言强化学习是由Watkins等[1-3]提出的基于心理学的一种全新的机器学习算法,其主要思想是通过智能体与环境的交互与试错,以环境的反馈信号作为输入实现策略的优化。实现策略优化需要正确的策略评价和策略迭代技术,而如何正确地估计函数值是策略评价的一个中心问题。强化学习

【相似文献】