基于智能体的多机器人系统学习方法研究

发布时间:2018-08-26 11:03
【摘要】:与单个机器人相比较,多机器人(MRS)具有很多优势和良好的发展前景,已经成为机器人领域中的研究热点。多机器人系统是一个复杂的动态系统,在设计机器人控制策略的时候,通常不能够预先为每个机器人设定好所有的最优行为。基于行为的方法能够让多机器人系统呈现出一些智能的特点,完成比较复杂的任务,极大地促进了多机器人系统的发展。但是仅采用基于行为的方法还不能完全适应不断变化的外界环境和不同任务的需求,让多机器人系统具有自主的学习能力,同时避免单一学习方法的局限性,从而不断提高个体机器人之间的协调协作能力是多机器人系统的重要发展方向。因此研究将不同的机器学习方法与基于行为的多机器人系统相结合具有很好的研究意义。本文采用智能体理论对多机器人系统进行研究,其主要的研究内容包括:首先,研究了智能体及多智能体系统的理论,分析了单机器人和多机器人系统的几种体系结构,提出将基于行为的方法和基于学习的方法相结合来探索多机器人协同的研究思路,同时设计了基于行为的机器人编队和足球系统。在多机器人系统众多的研究内容中,学习能力占据了重要位置。基于行为的方法具有鲁棒性强、灵活的特点,相对于其它的方法能更好地使机器人完成任务。本文以基于行为的方法为基础,结合不同的机器学习方法,针对多机器人系统的两个主要应用平台:机器人编队和足球,在机器人仿真软件Mission Lab和Teambots的基础上,设计了基于行为的多机器人系统,从而可以对本文提出的几种算法进行验证。其次,研究了粒子群优化算法(PSO)和基于案例的推理(CBR)方法,针对这两种方法各自的优势,提出了一种融合PSO与CBR的混合系统方法。传统的基于行为的方法虽然具有很多优点,但是其固定的行为参数难以适应外界复杂的环境。CBR作为人工智能中的一项重要技术,因为其具有易于检索和存储的特点,很适合为不同的行为提供相应的参数。但是传统的CBR方法缺乏有效的学习能力,因此本文提出将PSO作为CBR的优化器,让CBR不断得到更好的案例,同时PSO也可以通过CBR获得更好的初始种群。与遗传算法(GA)相比较,PSO也是一种群智能方法,但是具有结构更简单、实时性强和适合对连续问题进行优化的特点,可以说遗传算法能够解决的问题,粒子群优化算法都能够解决。本文将PSO算法与CBR方法相结合,不仅克服了CBR的缺点,同时也满足了实时性和对连续问题进行优化的需求。同时以基于行为的机器人编队为测试平台,与标准的粒子群优化算法相比较,验证了该方法的有效性。然后,研究了强化学习的基本理论和典型的Q-学习方法,针对传统Q-学习在多机器人系统中应用的缺点:缺乏信息交流和结构信度分配问题,提出了一种采用经验共享和滤波技术的改进Q-学习算法,从而改善了学习性能、提高了学习效率。Q-学习算法的理论基础是马尔可夫决策过程,直接把Q-学习应用到多机器人系统中虽然破坏了这个前提,但是Q-学习因为具有运算简单、状态-动作空间规模小的特点,在机器人学习中还是得到了广泛应用。与多智能体强化学习方法相比较,传统的Q-学习算法缺乏与其它智能体的信息交流,因此本文提出了采用经验共享的方式,每个智能体共享其它智能体的Q值信息,在学习的过程中采用了渐进的学习方式,利用?-Greedy策略以1-?的概率来选取其它智能体的学习经验。同时为了加速Q-学习的收敛,不同于简单地把回报信号统一分配给每个智能体,本文将卡尔曼滤波技术运用到回报信号的分配中,即把接收到的回报信号看作是真实的回报信号与噪声信号的结合,在一定程度上解决了结构信度分配问题。以机器人足球为测试平台,与传统的Q-学习算法相比较,验证了该方法的有效性。最后,研究了几种典型的多智能体强化学习算法Minimax-Q、Nash-Q、FFQ和CE-Q和基于后悔理论的学习方法,针对传统的CE-Q算法收敛速度慢的缺点:缺乏有效的行为探索策略,提出了一种采用无悔策略的新型CE-Q学习算法。马尔可夫对策理论为多智能体强化学习提供了很好的理论基础,纳什均衡在多智能体强化学习中起到了重要作用,因此这些算法也被称作基于均衡的学习算法。与Nash-Q学习算法中计算纳什均衡相比较,计算CE-Q中的相关均衡更容易,因此CE-Q有着更好的应用前景。但是传统的CE-Q学习方法缺乏有效的行为探索策略,因此影响了CE-Q学习方法的收敛速度。从无悔策略的理论中得到启发,如果每个智能体都选择减少平均后悔值的方法作为行为探索策略,那么所有智能体的行为将趋向于收敛到一组没有后悔值的集合点,这组集合点也被称为粗糙相关均衡集合。同时经过分析得到,纳什均衡和相关均衡在本质上都属于粗糙相关均衡。因此本文提出了采用减少平均后悔值的新型CE-Q学习算法,加快CE-Q学习方法的收敛速度。最后以机器人足球为测试平台,与传统的CE-Q学习算法相比较,验证了该方法的有效性。
[Abstract]:Compared with a single robot, multi-robot system (MRS) has many advantages and good prospects for development, and has become a research hotspot in the field of robotics. Multi-robot system is a complex dynamic system. When designing robot control strategies, it is usually not possible to set all the optimal behaviors for each robot in advance. Behavior-based method can make the multi-robot system show some intelligent characteristics and accomplish complex tasks, which greatly promotes the development of multi-robot system. However, the behavior-based method can not fully adapt to the changing environment and the needs of different tasks, so the multi-robot system can learn independently. It is an important development direction of multi-robot systems to improve the coordination and cooperation ability of individual robots by avoiding the limitation of single learning method. Therefore, it is of great significance to combine different machine learning methods with behavior-based multi-robot systems. The main research contents of multi-robot system include: Firstly, the theory of agent and multi-agent system is studied, several architecture of single robot and multi-robot system are analyzed, and the research idea of multi-robot cooperation is explored by combining behavior-based method with learning-based method. Behavior-based robot formation and soccer system are designed. Learning ability plays an important role in many research contents of multi-robot system. Behavior-based method has the characteristics of robustness and flexibility. Compared with other methods, behavior-based method can make robot accomplish tasks better. With the same machine learning method, for the two main application platforms of multi-robot system: robot formation and soccer, a behavior-based multi-robot system is designed on the basis of robot simulation software Mission Lab and Teambots, which can verify several algorithms proposed in this paper. Secondly, particle swarm optimization algorithm (PS) is studied. O) and Case-based Reasoning (CBR) methods are proposed to integrate PSO and CBR. Traditional behavior-based methods have many advantages, but their fixed behavior parameters are difficult to adapt to the complex environment. CBR is an important technology in artificial intelligence because of its advantages. It is easy to retrieve and store, so it is suitable to provide corresponding parameters for different behaviors. But the traditional CBR method lacks effective learning ability. So this paper proposes PSO as CBR optimizer, which can make CBR get better cases continuously. PSO can also get better initial population through CBR. It is similar to genetic algorithm (GA). In comparison, PSO is also a kind of swarm intelligence method, but it has the characteristics of simpler structure, strong real-time and suitable for continuous problems optimization. It can be said that genetic algorithm can solve the problems, particle swarm optimization algorithm can solve. This paper combines PSO algorithm with CBR method, not only overcomes the shortcomings of CBR, but also meets the real-time requirements. Then, the basic theory of reinforcement learning and the typical Q-learning method are studied to overcome the shortcomings of traditional Q-learning in multi-robot systems. In the absence of information exchange and structure reliability allocation, an improved Q-learning algorithm using experience sharing and filtering techniques is proposed, which improves learning performance and efficiency. The theoretical basis of Q-learning algorithm is Markov decision process. The application of Q-learning directly to multi-robot system destroys this premise. However, Q-learning is still widely used in robot learning because of its simplicity of operation and small size of state-action space. Compared with Multi-Agent Reinforcement learning, traditional Q-learning algorithm lacks information exchange with other agents. Therefore, this paper proposes a method of sharing experience with each agent. In order to speed up the convergence of Q-learning, instead of simply assigning the return signal to each agent, Kalman filter is used in this paper. In the distribution of return signal, the received return signal is regarded as the combination of real return signal and noise signal, which solves the problem of structure reliability allocation to a certain extent. Multi-agent reinforcement learning algorithms Minimax-Q, Nash-Q, FFQ and CE-Q, as well as learning methods based on regret theory, are proposed to overcome the slow convergence speed of traditional CE-Q algorithm: lack of effective behavior exploration strategy. A new CE-Q learning algorithm using no regret strategy is proposed. Markov game theory provides reinforcement learning for multi-agent. Nash Equilibrium plays an important role in Multi-Agent Reinforcement learning, so these algorithms are also called equilibrium-based learning algorithms. Compared with Nash-Q learning algorithm, it is easier to calculate the correlation equilibrium in CE-Q, so CE-Q has a better application prospect. Inspired by the theory of no-regret strategy, if each agent chooses the method of reducing the average regret value as the behavior exploration strategy, the behavior of all agents will tend to converge to a set of set points without regret value. At the same time, it is found that both Nash Equilibrium and correlation Equilibrium belong to rough correlation Equilibrium in essence. Therefore, a new CE-Q learning algorithm is proposed to speed up the convergence of CE-Q learning method by reducing the average regret value. Compared with the traditional CE-Q learning algorithm, the effectiveness of the method is verified.
【学位授予单位】:哈尔滨工业大学
【学位级别】:博士
【学位授予年份】:2016
【分类号】:TP242

【参考文献】

相关期刊论文 前10条

1 项祯桢;苏剑波;;表征空间中的机器人分层运动规划[J];控制理论与应用;2015年09期

2 张瑞雷;李胜;陈庆伟;杨春;;复杂地形环境下多机器人编队控制方法[J];控制理论与应用;2014年04期

3 李猛;梁加红;李石磊;;一种改进的多智能体碰撞避免行为[J];国防科技大学学报;2013年03期

4 黎萍;杨宜民;;基于博弈论的多机器人系统任务分配算法[J];计算机应用研究;2013年02期

5 吴军;徐昕;连传强;贺汉根;;协作多机器人系统研究进展综述[J];智能系统学报;2011年01期

6 李波;王祥凤;;基于动态Leader多机器人队形控制[J];长春工业大学学报(自然科学版);2009年02期

7 张英菊;仲秋雁;叶鑫;曲晓飞;;基于案例推理的应急辅助决策方法研究[J];计算机应用研究;2009年04期

8 廖振良;刘宴辉;徐祖信;;基于案例推理的突发性环境污染事件应急预案系统[J];环境污染与防治;2009年01期

9 贾兆红;陈华平;;基于改进遗传算法的权重发现技术[J];计算机工程;2007年05期

10 王学宁,徐昕,吴涛,贺汉根;策略梯度强化学习中的最优回报基线[J];计算机学报;2005年06期

相关博士学位论文 前10条

1 李s,

本文编号:2204673


资料下载
论文发表

本文链接:https://www.wllwen.com/shoufeilunwen/xxkjbs/2204673.html


Copyright(c)文论论文网All Rights Reserved | 网站地图 |

版权申明:资料由用户3b29c***提供,本站仅收录摘要或目录,作者需要删除请E-mail邮箱bigeng88@qq.com