OpenStack集群高可用方案设计与实现

发布时间：2018-04-18 18:07

本文选题：OpenStack + 高可用　；参考：《哈尔滨工业大学》2017年硕士论文

【摘要】：随着云计算技术的不断发展,用户可以像使用水、电等资源一样的使用计算机资源。为了便捷的管理云计算资源池中存在的大量的计算资源、网络资源和存储资源,出现了开源的云平台管理系统OpenStack。在金融、政治等领域,服务器承受着大量重要数据信息的计算和存储业务,如果服务器发生故障,将会带来灾难性的后果,产生巨大的损失。因此在服务器的器件损坏,系统崩溃,异常断电,网络异常等情况下,需要尽可能减少不可用时间,自动恢复,最大限度的保证系统的可用性。但OpenStack本身并不具备高可用功能,因此在利用OpenStack的便捷的同时,必须要补全它的高可用功能。本文通过对高可用集群的结构进行分析,在常见的corosync+pacemaker的高可用方案基础上,为了解决集群节点较多导致corosync收敛时间长的问题提出了检测域划分的思想,为了降低误判概率增加了基于管理网络和存储网络的双链路心跳检测方案。由于pacemaker本身包含的resource agent在节点多时表现不佳,开发了一套自己的资源代理,来完成上报物理主机故障的信息;上报虚拟机故障、关闭的信息;发送管理网络或存储网络故障的警告;通过corosync感知检测域内连通节点的变化;当前节点上报失败时,通过corosync令牌传递报文,横向寻找可用节点,上报迁移请求;通过共享存储,确定其它隔离主机或分裂组的心跳,维护分裂组列表,抢占域锁;对pacemaker通知的虚拟机故障进行响应等功能。在进行虚拟机迁移时需要决定迁移的目的主机,因此实现了动态资源调度服务来完成这项功能。经过功能测试和可用性测试后,系统可以完成虚拟机的启动关闭,在物理主机出现故障或虚拟机出现故障后,可以将虚拟机迁移,迁移后的虚拟机可以继续运行原虚拟机中运行的业务。虚拟机迁移时间均在二十秒左右,达到了高可用的标准。
[Abstract]:With the continuous development of cloud computing technology, users can use like water, electricity and other resources as the use of computer resources. In order to conveniently manage the resource pool of cloud computing in the presence of a large number of computing resources, storage resources and cyber source, the open source cloud platform management system OpenStack. in the financial, political and other fields, the server under a large number of important data information computing and storage business, if the server fails, will bring disastrous consequences, resulting in huge losses. So the server device is damaged, the system crashes, abnormal power off, the network abnormal circumstances, to minimize the time available, automatic recovery, ensure maximum system availability the. But OpenStack itself does not have the function of high availability, so in the use of OpenStack convenient at the same time, it must be hard to complete high. Based on high Structure analysis of available clusters, in high availability scheme based on common corosync+pacemaker, in order to solve the cluster nodes leads to more corosync convergence time is proposed to detect domain of thought, in order to reduce the probability of false positives increases the dual link heartbeat detection scheme based on network management and network storage. Because the pacemaker itself contains resource agent in the multi node performance, developed a set of their own resources agency, to complete the report physical host fault information reporting; virtual machine fault, closed information transmission management; network or storage network fault warning; change of connected nodes through corosync detection domain; the current node to fail when delivering messages through corosync the token, the node can be used for lateral migration, reporting requests; through shared memory, the other isolated host or split set Heartbeat, maintenance division list, seize the domain lock; response function of pacemaker virtual machine fault notification. Need to decide to move to the host in the virtual machine migration, thus realizes the dynamic resource scheduling services to complete this function. After the functional testing and usability testing, the system can complete the virtual machine start off, appear in the physical host virtual machine malfunction or failure, the virtual machine migration, migratedvirtual machine can continue to run the original virtual machine in the business. The virtual machine migration time were twenty seconds, achieved high availability standards.

【学位授予单位】：哈尔滨工业大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP393.09

【参考文献】