基于Storm的实时大数据分析系统的研究与实现

发布时间：2018-06-24 16:04

本文选题：Storm + 实时计算　；参考：《上海交通大学》2015年硕士论文

【摘要】：以Storm、Spark等为代表的实时计算技术是目前大数据处理领域的一个研究热点。本文以实验室承担的某省交通物流云计算平台建设项目为背景,该平台包括基于批处理的大数据分析服务和基于流式计算系统Storm所构建的实时数据处理服务。然而,Storm在实际应用中仍然存在一些问题:例如,默认调度器所采用的轮询分配策略将导致工作节点间出现负载不均衡;同时默认调度器单一的调度策略不能满足灵活多变的业务需求;而Nimbus控制节点存在的单点失效问题,容易导致Storm集群中出现任务提交、分配的失败。针对上述问题,本文通过分析交通物流云计算平台中实时数据处理的需求,在研究流式计算系统Storm和相关技术的基础上,设计并实现了一个基于Storm的大数据实时分析系统。该系统为物流企业的Saa S应用提供实时的数据分析处理服务,并解决了Storm中默认调度器工作节点任务分配不均匀、调度策略单一和Nimbus控制节点单点失效的问题。测试及应用的情况表明,该系统是可行且有效的。与同类型的系统相比,本文工作具有以下特点:1)为了提高系统的性能,针对Storm中默认调度器工作节点任务分配不均匀、调度策略单一等问题,提出了基于节点资源监控的RBS(Resource Based Schedule)任务调度算法和支持单节点的SNS(Single Node Schedule)任务调度算法。并在RBS算法和SNS算法的基础上,设计并实现了相应的Topology任务调度器。实验情况表明,基于RBS算法的任务调度器可根据工作节点资源的使用情况,将工作进程调度到资源利用率较低的节点上;而基于SNS算法的调度器可将一些只执行简单运算并且没有太多中间状态的Topology的多个工作进程调度到一个单一的物理节点上运行。2)为了提高系统的可用性,针对Storm中控制节点的单点失效问题提出了解决方案。该方案通过Zookeeper协调服务实现主控制节点选举和主从控制节点之间的状态同步。实验表明,由三个节点组成的控制节点集群,当主控制节点宕机时,某个从控制节点能够顺利选为主节点,保证Topology任务可不中断地运行。3)在上述工作的基础上,设计并实现了一个基于Storm的实时大数据分析系统,为物流企业的Saas应用提供实时的大数据分析服务。该系统包括流计算应用开发环境和流计算应用运行环境。流计算应用运行环境包括:基于Storm的流计算应用任务运行环境,包括:输入流组件、基于Ganglia监控服务的Topology任务调度器、基于Zookeeper协调服务的控制节点集群协调器、持久化输出组件;流计算应用的数据输入/输出服务运行环境,包括:数据采集与预处理器、Kafka中间件、No SQL数据库;应用开发环境包括集成开发工具、测试工具和部署工具:集成开发工具以Eclipse为基础,为应用开发人员提供数据采集与预处理器API库、输入流组件API库、持久化输出组件API库等;测试工具在封装单机版storm基础上,为流计算应用提供模拟运行环境。
[Abstract]:The real-time computing technology represented by Storm, Spark and so on is a research hotspot in the field of large data processing. This paper is based on the construction project of a provincial traffic logistics cloud computing platform, which includes batch processing based large data analysis service and real-time data processing based on flow computing system Storm. However, there are still some problems in the practical application of Storm: for example, the polling allocation strategy used by the default scheduler will lead to a load imbalance between the work nodes; at the same time, the default scheduler single scheduling strategy can not meet the flexible business needs; and the single point failure problem existing in the Nimbus control node is easy. In view of the problems mentioned above, this paper designs and implements a real-time analysis system of large data based on Storm based on the analysis of the demand of real-time data processing in the traffic logistics cloud computing platform, based on the analysis of the requirements of the real-time data processing in the traffic logistics cloud computing platform. This system is a logistics enterprise for the logistics enterprise. The system is based on the Storm based real-time analysis system for the logistics enterprises. The Saa S application provides real-time data analysis and processing services, and solves the problem of uneven assignment of task nodes in the default scheduler, single scheduling strategy and single point failure of Nimbus control nodes in Storm. The test and application show that the system is feasible and effective. Compared with the same type system, the work has a good performance. Characteristics: 1) in order to improve the performance of the system, the task scheduling algorithm based on RBS (Resource Based Schedule) based on node resource monitoring and SNS (Single Node Schedule) task scheduling algorithm based on node resource monitoring is proposed in order to improve the performance of the default scheduler work node in Storm, and the SNS (Single Node Schedule) task scheduling algorithm supporting single node is proposed. And RBS algorithm and SNS. On the basis of the algorithm, the corresponding Topology task scheduler is designed and implemented. The experiment shows that the task scheduler based on the RBS algorithm can schedule the working process to a node with lower resource utilization according to the usage of the working node resources; and the SNS algorithm based regulator can only perform simple operations and do not. In order to improve the availability of the Topology, a solution to the single point failure of the control nodes in Storm is proposed in order to improve the availability of the system. This scheme implements the state of the main control node election and the state of the master slave nodes through the Zookeeper coordination service. The experiment shows that the control node cluster consists of three nodes. When the main control node is down, one of the nodes can choose the main node from the control node and the Topology task can run.3 without interruption. On the basis of the above work, a real-time large data analysis system based on Storm is designed and implemented, which is the Saa of the logistics enterprise. The s application provides real-time large data analysis services. The system includes the flow computing application development environment and the flow computing application running environment. The flow computing application running environment includes the Storm based flow computing application task running environment, including the input stream components, the Topology task scheduler based on the Ganglia monitoring service, and the Zookeeper coordination suit. Control node cluster coordinator, persistent output component; data input / output service running environment for flow computing applications, including data acquisition and preprocessor, Kafka middleware, No SQL database; application development environment including integrated development tools, test tools and department tools: integrated development tools are based on Eclipse and are applied to applications The server provides a data collection and preprocessor API library, an input stream component API library, a persistent output component API library, etc. the test tool provides an analog running environment for streaming computing applications on the basis of a package single version of storm.
【学位授予单位】：上海交通大学
【学位级别】：硕士
【学位授予年份】：2015
【分类号】：TP311.52

【相似文献】