分布式流数据实时计算框架的研究和开发
本文关键词: 分布式流计算 计算模型 任务调度 动态均衡负载 时间序列预测算法 出处:《浙江理工大学》2013年硕士论文 论文类型:学位论文
【摘要】:随着大数据量计算技术的发展,基于数据处理的应用受到广泛关注,而数据源的结构也显示出多样化的趋势,这些数据中不仅有传统的非实时的、静态结构化数据,还有很多实时的、动态产生的非结构化数据流。这类连续到达的非结构化数据序列,它们的输入率、输入量和来源都在不断变化,很难准确预测。面对庞大变化的海量数据流,要获取流数据中携带的重要信息,实时地进行复杂计算,依靠传统的分布式计算模式很难实现。这就促使本文对分布式流数据实时计算这一新的计算模式展开深入研究。 目前,国内外针对分布式流数据实时计算框架的研究仍在起步阶段,尚没有一个成熟的产品。因此,作者在深入分析流数据处理应用需求的情况下,设计并实现了完整的分布式流数据实时计算框架iStream,对框架性能的关键性因素一负载均衡做了深入的研究和优化。经过实验和性能测试,证明该框架可以根据实际应用场景进行灵活的定制,并具有良好的实时性和可扩展性。本文的主要研究内容和成果如下: (1)对分布式计算框架中几个关键技术进行了研究,结合数据流形式的多样化和数据流应用场景的多样化的特点,本文实现和设计了一个不针对任何特定场景,可以解决多种复杂计算的分布式流数据实时计算平台iStream,它具有很强通用性和可扩展性,,显著提高了第三方开发人员的开发效率。 (2)为了增加吞吐量、加强数据处理能力、提高计算节点集群的灵活性和可用性,研究了动态调度技术以及负载均衡算法,提出了使用时间序列预测算法解决并行计算中的任务调度这—NP-完全问题,并通过改进模型化AR模型评估算法来处理非平稳数据序列,使得程序更有效率,预测更精准,并可适用于流数据这类不能用简单的分段模型表示的数据源,同时保证了动态负载均衡算法的性能。 (3)系统框架的设计与实现。在研究了并行计算中主流编程模型,诸如MapReduce等模型的基础上,将改进的发布—订阅者模型用到iStream框架中,并分析比较了多种主流的分布式进程通信方式,解决了高并发实时处理,分布式系统数据通信安全和自适应调整等分布式系统中的关键问题。并结合流计算的特点,在框架各模块的设计与实现中,对传统分布式计算策略进行了改进,提高了框架的安全性,显著降低了延迟率。 (4)深入分析了分布式实时计算框架的适用场景,并通过基于CTR效果广告系统和在线参数优化系统作为案例研究了iStream在商业应用中的效果。最后对本课题进行了总结和下一步研究的展望。
[Abstract]:With the development of computing technology of large amount of data, the application of data processing has been paid more and more attention, and the structure of data source has shown a trend of diversification. Not only are there traditional non-real-time, static structured data in these data, There are also a lot of real-time, dynamic unstructured data streams. These unstructured data sequences that arrive in succession, whose input rates, inputs, and sources are constantly changing, are difficult to predict accurately. In order to obtain the important information carried in the stream data and carry out complex computing in real time, it is difficult to realize the traditional distributed computing model, which leads to the in-depth study of the new computing mode of real-time computing of distributed stream data in this paper. At present, the research on the real-time computing framework for distributed stream data is still in its infancy, and there is not a mature product. Therefore, the author analyzes the application requirements of streaming data processing in depth. A complete real-time computing framework for distributed stream data, iStream, is designed and implemented. The key factor of the performance of the framework, load balancing, is deeply studied and optimized. It is proved that the framework can be flexibly customized according to the actual application scenarios and has good real-time and extensibility. The main research contents and results of this paper are as follows:. In this paper, several key technologies in the distributed computing framework are studied. Considering the diversity of data flow forms and the diversity of data flow application scenarios, a non-specific scenario is implemented and designed in this paper. IStream, a real-time computing platform for distributed stream data that can solve many complex computations, has strong generality and extensibility, and improves the development efficiency of third-party developers. In order to increase throughput, enhance data processing ability and improve the flexibility and availability of computing node cluster, dynamic scheduling technology and load balancing algorithm are studied. In this paper, a time series prediction algorithm is proposed to solve the problem of task scheduling in parallel computing. The improved AR model evaluation algorithm is used to deal with non-stationary data sequences, which makes the program more efficient and accurate. It can be applied to stream data which can not be represented by simple segmental model, and the performance of dynamic load balancing algorithm is guaranteed at the same time. Based on the study of the mainstream programming model in parallel computing, such as MapReduce, the improved publish-subscriber model is used in the iStream framework. This paper also analyzes and compares the main communication methods of distributed process, solves the key problems in distributed system, such as high concurrent real-time processing, data communication security and adaptive adjustment of distributed system, and combines the characteristics of stream computing. In the design and implementation of each module of the framework, the traditional distributed computing strategy is improved to improve the security of the framework and significantly reduce the delay rate. Finally, the application of distributed real-time computing framework is analyzed. Based on the CTR effect advertising system and online parameter optimization system as a case study of the effect of iStream in commercial applications. Finally, this paper summarizes the topic and prospects for the next research.
【学位授予单位】:浙江理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP338.8
【参考文献】
相关期刊论文 前8条
1 周筱瑜;雷晓俊;陈芳;;分布式系统中的通信方式:RPC与RMI[J];电脑与电信;2012年03期
2 周晓峰,王志坚;分布式计算技术综述[J];计算机时代;2004年12期
3 杨学军;曾丽芳;邓宇;唐玉华;;Imagine流处理器上流的优化组织方法[J];计算机学报;2008年07期
4 高雅侠;邹海荣;;基于Java的RMI技术的研究与应用[J];计算机与数字工程;2011年08期
5 杨颖,韩忠明,杨磊;数据流的核心技术与应用发展研究综述[J];计算机应用研究;2005年11期
6 王友良,叶柏龙;分布式系统中动态负载平衡的研究[J];科学技术与工程;2005年09期
7 刘利;何先平;袁文亮;;检测非平稳时间序列中离群点和变化点的统一框架[J];太原师范学院学报(自然科学版);2011年03期
8 陈涛;陈启买;;分布式计算机系统负载平衡研究[J];计算机技术与发展;2006年05期
相关硕士学位论文 前6条
1 李登;分布式系统负载均衡策略研究[D];中南大学;2002年
2 王友良;基于CORBA中间件的负载平衡服务的研究[D];湖南大学;2005年
3 杨伟伟;一个基于负载平衡的网络漏洞管理系统[D];南京理工大学;2008年
4 余涛;无人值守载货车辆自动称重系统的设计与实现[D];北京交通大学;2010年
5 周顺;面向Web Service的负载均衡策略研究[D];湖南大学;2010年
6 李琳;基于RFID的物联网运维管理系统的设计与实现研究[D];华中师范大学;2012年
本文编号:1495624
本文链接:https://www.wllwen.com/wenyilunwen/guanggaoshejilunwen/1495624.html