支持动态任务拓扑与负载分流的流式处理系统D-Stream的研究与实现

发布时间：2018-12-13 16:36

【摘要】：大数据时代的来临,为数据的实时处理技术带来了巨大的变革和挑战,在这个背景下,D-Stream作为D-Ocean非结构化数据管理系统的流式处理子系统,为基于海量数据实时处理的应用提供给了一套通用的、可靠的、可扩展的分布式计算框架。 D-Stream系统的实现基于一套流式处理通用框架设计,借鉴了S4, Storm等众多开源流式处理平台的先进思想。它的功能结构主要包含三大部分：首先是一套简洁开放的任务模型,通过D-Stream任务模型,应用能够根据需求动态地定制任务拓扑。其次是一套可靠稳定的流式处理引擎,保障数据在计算任务间快速透明地传输,使不同的任务可以高效地协调工作。最后是一个高可用的调度框架,通过有效调度计算资源,充分发挥集群的计算能力,并在消息积压时提供高效的负载分流机制。围绕这三大问题,本文描述了用D-Stream任务模型为现实应用建模的方法。就D-Stream实现部分,介绍了D-Stream组件架构和对称式调度框架,重点描述了流式处理引擎实现中用到的相关算法和设计模式,并针对D-Stream系统的各方面特性给予了全面评估。最后本文通过D-Ocean CBIR勺应用案例,验证了在海量数据的实时应用中D-Stream系统的优越性。
[Abstract]:The advent of big data era has brought great changes and challenges to the real-time data processing technology. Under this background, D-Stream is the flow processing subsystem of D-Ocean unstructured data management system. This paper presents a general, reliable and extensible distributed computing framework for applications based on real-time processing of mass data. The implementation of D-Stream system is based on a set of general frame design of flow processing, and the advanced ideas of many open source streaming processing platforms such as S4, Storm and so on are used for reference. Its functional structure mainly includes three parts: first, a set of simple and open task model. Through the D-Stream task model, the application can dynamically customize the task topology according to the requirements. The second is a reliable and stable flow engine, which ensures the fast and transparent transmission of data between computing tasks, so that different tasks can be coordinated efficiently. Finally, it is a highly available scheduling framework, which can effectively schedule computing resources, give full play to the computing power of the cluster, and provide an efficient load shunt mechanism in the case of message backlog. Around these three problems, this paper describes the method of using D-Stream task model to model practical applications. In the part of D-Stream implementation, the D-Stream component architecture and the symmetric scheduling framework are introduced, and the relevant algorithms and design patterns used in the implementation of the flow processing engine are described, and the overall evaluation of the characteristics of the D-Stream system is given. Finally, this paper verifies the superiority of D-Stream system in the real time application of massive data through the application case of D-Ocean CBIR spoon.
【学位授予单位】：浙江大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP338.8

【参考文献】