大规模集群管理平台监控和告警技术的研究和应用
发布时间:2018-04-14 07:03
本文选题:Hadoop集群 + 监控 ; 参考:《北京邮电大学》2016年硕士论文
【摘要】:近年来,随着信息化时代的到来,数据呈现爆炸式的增长趋势,普通的大型计算机已经无法承担计算海量数据的任务,各大互联网公司纷纷采用大规模的Hadoop集群来完成数据的存储和分析。随着Hadoop集群的规模越来越大,确保大规模集群的稳定运行成为重点关注的问题。为了实时监控集群的运行状况、在集群出现问题的时候可以及时通知管理者,需要开发一套部署在集群上的监控和告警工具,此工具能够提供一个可视化前端界面,并且支持自动化部署和随时扩展监控告警指标,能够让管理者有效的维护集群,保证集群的正常工作。论文首先介绍了本课题的研究背景和研究意义,对国内外已有的集群监控告警软件做了调查和介绍,阐明本文将要研究的监控告警工具的特点和应用价值。其次,介绍了承载监控告警工具的分布式集群的基础知识,以及设计监控和管理系统时所考虑到的三个关键问题,并初步介绍了对应的解决方案和设计思想。然后详细描述了监控告警平台的整体架构,展示了监控管理平台的运行流程,介绍了平台的各个组成模块。重点分析了系统的两个主要组成模块:采集模块和告警模块,对采集模块进行了描述,从采集模块的结构图说起,详细介绍了采集模块的工作机制;接着介绍了告警模块的结构和工作流程,包括告警的监控指标等。之后从数据库中提取采集到的性能指标数据,并对其作图分析集群的运行状况。最后,对已完成的研究做了总结,提出系统中不足的地方,并对未来需要完善的方面做了展望。
[Abstract]:In recent years, with the arrival of the information age, the data show an explosive growth trend, ordinary mainframe computers can no longer undertake the task of computing massive data.Major Internet companies have adopted large-scale Hadoop clusters to complete data storage and analysis.With the increasing scale of Hadoop cluster, it becomes a key issue to ensure the stable operation of large cluster.In order to monitor the running status of the cluster in real time, the manager can be informed in time when there is a problem in the cluster. It is necessary to develop a set of monitoring and warning tools deployed on the cluster, which can provide a visual front-end interface.It also supports automatic deployment and extended monitoring and alarm indicators at any time, which enables managers to maintain the cluster effectively and ensure the normal operation of the cluster.Firstly, this paper introduces the research background and significance of this subject, investigates and introduces the existing cluster monitoring and warning software at home and abroad, and clarifies the characteristics and application value of the monitoring and warning tools to be studied in this paper.Secondly, the basic knowledge of distributed cluster with monitoring and warning tools is introduced, and the three key problems considered in the design of monitoring and management system are introduced, and the corresponding solutions and design ideas are also introduced.Then the whole structure of the monitoring alarm platform is described in detail, the running flow of the monitoring and management platform is shown, and each component module of the platform is introduced.The two main modules of the system are analyzed emphatically: the collection module and the alarm module. The collection module is described, and the working mechanism of the collection module is introduced in detail from the structure diagram of the collection module.Then, the structure and work flow of alarm module are introduced, including alarm monitoring index and so on.After that, the collected performance index data are extracted from the database, and the running status of the cluster is analyzed by drawing the data.Finally, the paper summarizes the completed research, puts forward the deficiencies of the system, and looks forward to the future needs to be improved.
【学位授予单位】:北京邮电大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP277;TP311.52
【参考文献】
相关期刊论文 前3条
1 徐建;张琨;刘凤玉;;基于Linux的计算系统性能监控[J];南京理工大学学报(自然科学版);2007年05期
2 童端,董小社,李纪云,吴维刚;基于Web的远程集群监控系统的设计与实现[J];计算机工程与应用;2003年35期
3 查礼,徐志伟,林国璋,刘玉树,刘东华,李伟;基于LDAP的网格监控系统[J];计算机研究与发展;2002年08期
,本文编号:1748205
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/1748205.html