基于云平台的集群故障监控的研究与实现

发布时间：2018-06-30 05:21

本文选题：云平台 + 监控系统　；参考：《北京邮电大学》2014年硕士论文

【摘要】：随着互联网技术普及和信息化技术的不断提高,社会上各个领域对信息化的要求越来越高,处理的数据也不断增加。云计算已从概念落实到实际应用中,发展已臻成熟,已发展为可个性化定制、伸缩可扩展、面向服务的公有云或私有云。云平台的服务质量对于云平台有着重要的意义,监控是云计算平台的重要组成部分,它是云计算平台中很多诸如网络分析、系统管理、作业调度、负载均衡、事件预测、故障检测以及恢复操作的前提,可以帮助云计算平台动态量化资源使用、检测服务缺陷、发现用户使用模式、辅助资源调度模块决策,可以提高云计算平台的服务质量。 BC-PDM (Big Cloud of Parallel Data Mining)是全球最大的电信运营企业的商务智能应用需求背景,旨在针对海量数据提供高效、准确、便捷的数据分析服务。本系统是基于Hadoop集群开发的,本论文主要介绍了Hadoop集群的故障监控的研究与实现过程。本文首先介绍了研究背景和研究现状,然后针对项目本身的需求,给出总体功能设计和各模块设计。本文使用Ganglia和Nagios这两个开源监控工具,通过对工具的深入调研,总结了其工作原理及优势、缺点等,将Ganglia和Nagios优势结合,同时优化Ganglia的容错机制,实现故障监控和资源监控的功能。Ganglia和Nagios的监控数据在存储方面都存在一些问题,系统通过持久化存储工具将监控数据转存到Mysql数据库中,进行监控数据统一管理和分析,优化监控数据存储问题。本系统利用开源监控工具Ganglia和Nagios,通过系统需求分析、系统关键点研究,最后完成了资源监控和故障监控功能。实现了对云平台中的物理资源、虚拟资源、服务资源等的全面监控和资源利用率的分析,并根据分析实现邮件、短信等多种方式的故障监控,以达到资源监控和故障监控的目的,保证云平台的正常运行。最后应用以上的研究实现了一个云平台监控系统,其运行效果表明本文的策略是有效可行的。
[Abstract]:With the popularization of Internet technology and the continuous improvement of information technology, the requirements of information technology in various fields of society are becoming higher and higher, and the number of data processed is also increasing. Cloud computing has been implemented from the concept to practical applications, the development has matured, has developed into personalized customization, scalable and scalable, service-oriented public or private cloud. Monitoring is an important part of cloud computing platform. It is a lot of cloud computing platform such as network analysis, system management, job scheduling, load balancing, event prediction. The premise of fault detection and recovery operation can help cloud computing platform to dynamically quantify resource usage, detect service defects, discover user usage patterns, and assist resource scheduling module decision-making. BC-PDM (Big Cloud of parallel data Mining) is the business intelligence application requirement background of the world's largest telecom operators, aiming at providing efficient, accurate and convenient data analysis services for mass data. This system is based on Hadoop cluster. This paper mainly introduces the research and implementation of Hadoop cluster fault monitoring. This paper first introduces the research background and research status, then according to the requirements of the project itself, gives the overall function design and each module design. This paper uses ganglia and Nagios, two open source monitoring tools, through the in-depth investigation of the tool, summarizes its working principle and advantages, shortcomings, etc., combines ganglia and Nagios advantages, and optimizes the fault-tolerant mechanism of ganglia. There are some problems in storing the monitoring data of Ganglia and Nagios, which can realize the functions of fault monitoring and resource monitoring. The system transfers the monitoring data to MySQL database through persistent storage tools, and manages and analyzes the monitoring data uniformly. Optimization of monitoring data storage problem. This system uses open source monitoring tools ganglia and Nagios, through system requirement analysis, system key points research, finally completed the resource monitoring and fault monitoring functions. It realizes the overall monitoring of physical resources, virtual resources and service resources in the cloud platform and the analysis of resource utilization. According to the analysis, it realizes the malfunction monitoring of mail, short message, etc. In order to achieve the purpose of resource monitoring and fault monitoring, ensure the normal operation of cloud platform. Finally, a cloud platform monitoring system is implemented by using the above research. The results show that the strategy is effective and feasible.
【学位授予单位】：北京邮电大学
【学位级别】：硕士
【学位授予年份】：2014
【分类号】：TP393.09

【参考文献】