基于强化学习的路径规划问题研究
发布时间:2018-06-24 04:12
本文选题:未知环境 + 强化学习 ; 参考:《哈尔滨工业大学》2017年硕士论文
【摘要】:不确定条件下的人机共生研究领域关注以机器学习为核心的环境和态势感知、动作或路径规划与决策,以及对决策结果的评价。它即包含科学理论问题,也有许多工程技术问题。研究这些科学问题和工程技术问题有明显的理论意义和实用价值。本课题主要研究未知环境下智能体路径规划的强化学习解决方案。机器人或智能体在特定环境下的路径规划是指从指定起点找到一条到达终点的路径,该路径不与障碍物发生碰撞。路径规划问题的研究由来已久,也产生了许多成熟的算法,但是这些算法多数基于已知环境模型,并结合搜索的方法。然而在很多情况下,环境的模型难以获取;另一方面,机器人执行动作时由于控制误差或环境因素导致发出的指令和执行结果产生偏差,无法按照规划好的路径去行走,甚至无法到达终点;第三,规划出的路径可能十分曲折,充满拐点,不利于机器人的实际行走。针对以上几个问题,本文利用强化学习中时间差分法来解决路径规划问题,并且针对强化学习中存在的探索利用平衡问题提出了优化的解决方法。论文主要内容如下:(1)使用强化学习中的时间差分法解决路径规划问题。相比于其他算法,优势在于不需要对环境进行建模,而且具有一定的自适应性和自学习能力,能够应对智能体运动存在不确定性的情况。利用仿真实验对算法进行了验证,结果表明时间差分法能够较快收敛,并且可以在任意位置找到到达目标的路径。(2)改进强化学习在实际应用中存在的探索与利用平衡问题。在强化学习中,探索环境与利用环境是一直存在的两个过程,过多的探索会使训练时间变长,过多的利用会使智能体收敛到不正确的解上,如何平衡探索和利用便成了一个重要的研究方向。传统方法通常随着训练时间的增加而减少探索,没有考虑环境和问题本身的复杂程度。本文基于路径规划问题,以智能体到达目标成功率为指标来衡量智能体对环境的掌握程度,从而动态调整探索因子,使智能体在对环境掌握程度较低时更多地对环境进行探索,在对环境掌握程度变大时逐渐减少探索,更多地利用环境。利用仿真实验进行了验证,结果表明改进后的探索方法能够更好地平衡探索与利用,使智能体更快到达目标点。
[Abstract]:The research field of human-computer symbiosis under uncertain conditions focuses on the environmental and situational awareness, action or path planning and decision making, as well as the evaluation of decision results, which are centered on machine learning. It contains not only scientific theoretical problems, but also many engineering technical problems. It is of great theoretical significance and practical value to study these scientific and technical problems. This paper mainly studies the reinforcement learning solution of agent path planning in unknown environment. The path planning of a robot or an agent in a specific environment refers to finding a path from a specified starting point to an end, which does not collide with an obstacle. The path planning problem has been studied for a long time, and many mature algorithms have been produced, but most of these algorithms are based on known environmental models and combined with search methods. However, in many cases, the environment model is difficult to obtain; on the other hand, the robot can not follow the planned path to walk because of the error of control or environmental factors, which leads to the deviation of the instruction and the result of the execution. Third, the planned path may be very tortuous and full of inflection points, which is not conducive to the actual walking of the robot. In view of the above problems, this paper uses the time-difference method in reinforcement learning to solve the path planning problem, and puts forward an optimized solution to the problem of exploring and utilizing balance in reinforcement learning. The main contents of this paper are as follows: (1) the path planning problem is solved by time difference method in reinforcement learning. Compared with other algorithms, it has the advantage that it does not need to model the environment, and it has the ability of self-adaptability and self-learning, so it can deal with the uncertainty of the agent motion. The simulation results show that the time-difference method can converge quickly and can find the path to the target at any location. (2) the problem of exploring and utilizing balance in practical application of improved reinforcement learning is discussed. In intensive learning, exploring environment and utilizing environment are two processes that exist all the time. Too much exploration will make the training time longer, too much use will make the agent converge to the incorrect solution. How to balance exploration and utilization has become an important research direction. The traditional method usually reduces the exploration as the training time increases, regardless of the complexity of the environment and the problem itself. Based on the path planning problem, this paper uses the success rate of the agent to measure the degree of mastery of the environment, so as to dynamically adjust the exploration factor, so that the agent can explore the environment more when the degree of mastery of the environment is low. Gradually reduce the exploration and make more use of the environment when the degree of environmental mastery becomes larger. The simulation results show that the improved method can balance the exploration and utilization better and make the agent reach the target point more quickly.
【学位授予单位】:哈尔滨工业大学
【学位级别】:硕士
【学位授予年份】:2017
【分类号】:TP18;TP242
【参考文献】
相关期刊论文 前5条
1 朱大奇;颜明重;;移动机器人路径规划技术综述[J];控制与决策;2010年07期
2 乔俊飞;侯占军;阮晓钢;;基于神经网络的强化学习在避障中的应用[J];清华大学学报(自然科学版);2008年S2期
3 孟宪权;赵英男;薛青;;遗传算法在路径规划中的应用[J];计算机工程;2008年16期
4 赫东锋;孙树栋;;一种在线自学习的移动机器人模糊导航方法[J];西安工业大学学报;2007年04期
5 毕盛;朱金辉;闵华清;钟汉如;;基于模糊逻辑的机器人路径规划[J];机电产品开发与创新;2006年01期
相关博士学位论文 前1条
1 刘传领;基于势场法和遗传算法的机器人路径规划技术研究[D];南京理工大学;2012年
相关硕士学位论文 前1条
1 傅晓霞;基于状态预测强化学习的移动机器人路径规划研究[D];山东大学;2008年
,本文编号:2059956
本文链接:https://www.wllwen.com/kejilunwen/zidonghuakongzhilunwen/2059956.html