基于云存储的数据流处理技术的研究

发布时间：2018-01-03 15:00

本文关键词：基于云存储的数据流处理技术的研究　出处：《武汉理工大学》2013年硕士论文　论文类型：学位论文

更多相关文章： 云存储 数据流处理 HDFS Map/Reduce

【摘要】：自2006年Google提出云计算概念以来,云计算从备受业界质疑的概念炒作成为如今越来越成熟的技术服务形态。在云计算提供的众多服务类型里,存储服务成为我们最为直接使用的一种云计算服务,并发展成为业界独立研究的领域,目前众多的IT巨头都在云存储领域进行布局。云存储是为大数据时代而生的,如何更加高效、快速、安全的进行海量数据的存储、管理和处理,仍然是吸引众多IT人士不断研究的课题。如今,在云存储的后端,Hadoop作为最适合大数据处理的开源技术,正被广泛的研究和使用。但由于Hadoop的兴起时间不长,其还存在着一些设计缺陷,并且由于众多云存储服务提供商会根据服务的类别和数据中心的实际情况,需要对Hadoop进行改进,以便提供更好的服务。本文的研究内容包括云存储中数据流处理关键技术Hadoop,Hadoop是把数据以流的方式来进行处理的。通过对Hadoop平台中核心组件HDFS分布式文件系统的框架和执行流程的研究,针对其单一主控节点NameNode的设计缺陷,提出了一种主控节点压力分解的方法,在可接受的性能损失范围内,减轻了HDFS架构中单一主控节点的访问压力,对系统架构做出了一定的修改,使其在总体上能够承担更多的访问请求,并且降低了单一节点在过量负载时造成的不稳定性甚至是崩溃的风险,使系统的健壮性进一步提高。另外,本文对HDFS的元数据信息进行二次备份设计,进一步提高了系统的可靠性。本文还对Hadoop的另一核心组件Map/Reduce的数据流处理机制进行研究,针对其过于消耗资源的缺点,提出了一种在特定情况下能够有效的降低Map/Reduce的资源消耗的优化方式。本文针对这些特殊情况下,对元数据的数据结构进行改进,使Map/Reduce在进行数据流处理之前能够先从HDFS获得元数据信息,进行数据块的精准定位,过滤掉不必要的数据处理,使HDFS对Map/Reduce的支持进一步提高,能够有效的降低数据处理过程中的资源消耗,避免了资源的浪费。在本文的最后,通过多次实验,使优化后的系统与原始架构中数据处理情况进行比较。实验数据表明,改进后的系统在平衡资源消耗以及负载压力情况下,能够达到我们预期的结果。本文得到国家自然科学基金项目(批准号：60970064)的资助。
[Abstract]:Since Google put forward the concept of cloud computing in 2006, cloud computing has become a more and more mature technology service form from the concept that has been questioned by the industry. Storage service has become the most direct use of cloud computing services, and has developed into an independent field of research in the industry. At present, a large number of IT giants are in the cloud storage field layout. Cloud storage is for the era of big data, how to more efficient, fast, safe storage, management and processing of massive data. Today, Hadoop on the back end of cloud storage is best suited to big data's open source technology. Is being widely studied and used, but because the rise of Hadoop time is not long, it also has some design defects. And because many cloud storage service providers need to improve Hadoop to provide better service according to the type of service and the actual situation of data center. The research content of this paper includes Hadoop, the key technology of data stream processing in cloud storage. Hadoop deals with the data in the way of stream. Through the research of the framework and execution flow of HDFS distributed file system, the core component of Hadoop platform. Aiming at the design defect of NameNode, a method of pressure decomposition is proposed, which is in the range of acceptable performance loss. It reduces the access pressure of the single master node in the HDFS architecture and makes some modifications to the system architecture so that it can take on more access requests on the whole. It also reduces the risk of instability or even crash caused by a single node in excess load, and further improves the robustness of the system. In addition, this paper designs the secondary backup of HDFS metadata information. The reliability of the system is further improved. This paper also studies the data flow processing mechanism of Map/Reduce, another core component of Hadoop, aiming at its shortcomings of consuming too much resources. This paper proposes an optimization method that can effectively reduce the resource consumption of Map/Reduce under certain circumstances. This paper improves the data structure of metadata under these special circumstances. The Map/Reduce can obtain metadata information from HDFS before processing data flow, locate the data block accurately and filter out unnecessary data processing. The support of HDFS to Map/Reduce can be further improved, which can effectively reduce the resource consumption in the process of data processing and avoid the waste of resources. At the end of this paper, through many experiments, the optimized system is compared with the data processing in the original architecture. The experimental data show that the improved system is balanced in the case of resource consumption and load pressure. Be able to achieve the desired results. This paper is supported by the National Natural Science Foundation of China (Grant No.: 60970064).
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】