大规模分布式系统监控技术研究与应用

发布时间：2018-01-19 08:29

本文关键词： 分布式系统监控调用链监控采样故障诊断聚合操作　出处：《浙江大学》2017年硕士论文　论文类型：学位论文

【摘要】：分布式系统是规模以及复杂度不断扩大的计算机应用的主要表现形式。其中,分布式追踪系统以及分布式性能监控系统是大型分布式系统诊断异常、性能监控、保证系统稳定的重要手段,分布式追踪系统负责监控分布式系统各服务间调用情况,分布式性能监控系统负责监控分布式系统各组件对资源的消耗情况。分布式系统中存在难以快速准确定位错误、监控采集的数据价值不高、监控数据采集查询时资源消耗高等问题,本论文就监控数据采样、数据分析以及监控数据存储索引等方面提出了快速异常诊断以及降低监控数据采集查询资源消耗的方案,具体工作如下:1.提出了一种后验式调用链采集方案。现有大规模分布式系统中异常调用链的比例非常小。针对这种情况,该方案通过节点预判调用是否异常,仅还原出错调用链并存储。较之传统分布式系统监控追踪系统采用固定采样率,提升了存储调用监控日志数据的价值,节省了网络、存储资源的消耗。2.提出了一种基于决策树分类方法的调用链故障诊断方法用于解决分布式系统中遇到错误难以快速准确定位原因的问题。该方法通过对已知的异常调用链数据集进行特征提取,分类错误调用链为不同错误类型。快速定位错误原因,解决分布式系统难以快速准确诊断故障的问题。3.提出了一种基于散列概要森林的时序数据索引方法,优化监控数据规模庞大时对大跨度时间范围中对时序数据进行统计、聚合查询时的资源时间消耗。该方法结合概要森林树形索引方案,优化时序数据聚合操作速度,并结合一种基于Hbase的线段树散列存储方案,解决Hbase分布式存储时序数据产生热点问题。基于以上几点,本文构建了钱塘分布式追踪系统(JTang Tracer),该系统对应分布式系统调用链追踪与分析,并可视化调用数据,较之传统分布式监控系统,该系统可以节省更多的资源以及采集更有价值的数据。
[Abstract]:Distributed system is the main form of computer application with increasing scale and complexity, in which distributed tracking system and distributed performance monitoring system are large-scale distributed systems to diagnose anomalies and monitor performance. The distributed tracking system is responsible for monitoring the calls between the services of the distributed system, which is an important means to ensure the stability of the system. The distributed performance monitoring system is responsible for monitoring the resource consumption of each component of the distributed system. In the distributed system, it is difficult to locate the data quickly and accurately, and the value of the data collected is not high. The problem of high resource consumption in monitoring data acquisition and query is discussed in this paper. Data analysis and monitoring data storage index and other aspects of the rapid exception diagnosis and reduce the cost of monitoring data collection and query resources. The specific work is as follows: 1. A post-call chain acquisition scheme is proposed. The proportion of abnormal call chains in existing large-scale distributed systems is very small. In view of this situation. This scheme can only restore the error call chain and store it. Compared with the traditional distributed system monitoring and tracking system, it adopts a fixed sampling rate, which improves the value of storing call log data. Save the Internet. 2. A fault diagnosis method of call chain based on decision tree classification method is proposed to solve the problem that it is difficult to locate the fault quickly and accurately in distributed system. For feature extraction by calling the chain dataset. Classification error call chain is different types of errors. Quickly locate the error causes and solve the problem that distributed system can not diagnose faults quickly and accurately. 3. A method of indexing temporal data based on hash summary forest is proposed. When the monitoring data scale is large, the time-series data are counted in a large span of time, and the resource time consumption of aggregate query is collected. This method is combined with the outline forest tree index scheme. Optimizing the operation speed of sequential data aggregation and combining a line segment tree hash storage scheme based on Hbase to solve the hot problem of Hbase distributed storage temporal data. Based on the above several points. In this paper, a distributed tracking system of Qiantang, JTang tracker, is constructed, which corresponds to the tracing and analysis of the distributed system call chain, and the visual transfer of data, compared with the traditional distributed monitoring system. The system can save more resources and collect more valuable data.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP277

【参考文献】