基于云存储的数据流处理技术的研究
本文关键词:基于云存储的数据流处理技术的研究 出处:《武汉理工大学》2013年硕士论文 论文类型:学位论文
更多相关文章: 云存储 数据流处理 HDFS Map/Reduce
【摘要】:自2006年Google提出云计算概念以来,云计算从备受业界质疑的概念炒作成为如今越来越成熟的技术服务形态。在云计算提供的众多服务类型里,存储服务成为我们最为直接使用的一种云计算服务,并发展成为业界独立研究的领域,目前众多的IT巨头都在云存储领域进行布局。云存储是为大数据时代而生的,如何更加高效、快速、安全的进行海量数据的存储、管理和处理,仍然是吸引众多IT人士不断研究的课题。如今,在云存储的后端,Hadoop作为最适合大数据处理的开源技术,正被广泛的研究和使用。但由于Hadoop的兴起时间不长,其还存在着一些设计缺陷,并且由于众多云存储服务提供商会根据服务的类别和数据中心的实际情况,需要对Hadoop进行改进,以便提供更好的服务。 本文的研究内容包括云存储中数据流处理关键技术Hadoop,Hadoop是把数据以流的方式来进行处理的。通过对Hadoop平台中核心组件HDFS分布式文件系统的框架和执行流程的研究,针对其单一主控节点NameNode的设计缺陷,提出了一种主控节点压力分解的方法,在可接受的性能损失范围内,减轻了HDFS架构中单一主控节点的访问压力,对系统架构做出了一定的修改,使其在总体上能够承担更多的访问请求,并且降低了单一节点在过量负载时造成的不稳定性甚至是崩溃的风险,使系统的健壮性进一步提高。另外,本文对HDFS的元数据信息进行二次备份设计,进一步提高了系统的可靠性。 本文还对Hadoop的另一核心组件Map/Reduce的数据流处理机制进行研究,针对其过于消耗资源的缺点,提出了一种在特定情况下能够有效的降低Map/Reduce的资源消耗的优化方式。本文针对这些特殊情况下,对元数据的数据结构进行改进,使Map/Reduce在进行数据流处理之前能够先从HDFS获得元数据信息,进行数据块的精准定位,过滤掉不必要的数据处理,使HDFS对Map/Reduce的支持进一步提高,能够有效的降低数据处理过程中的资源消耗,避免了资源的浪费。 在本文的最后,通过多次实验,使优化后的系统与原始架构中数据处理情况进行比较。实验数据表明,改进后的系统在平衡资源消耗以及负载压力情况下,能够达到我们预期的结果。 本文得到国家自然科学基金项目(批准号:60970064)的资助。
[Abstract]:Since Google put forward the concept of cloud computing in 2006, cloud computing has become a more and more mature technology service form from the concept that has been questioned by the industry. Storage service has become the most direct use of cloud computing services, and has developed into an independent field of research in the industry. At present, a large number of IT giants are in the cloud storage field layout. Cloud storage is for the era of big data, how to more efficient, fast, safe storage, management and processing of massive data. Today, Hadoop on the back end of cloud storage is best suited to big data's open source technology. Is being widely studied and used, but because the rise of Hadoop time is not long, it also has some design defects. And because many cloud storage service providers need to improve Hadoop to provide better service according to the type of service and the actual situation of data center. The research content of this paper includes Hadoop, the key technology of data stream processing in cloud storage. Hadoop deals with the data in the way of stream. Through the research of the framework and execution flow of HDFS distributed file system, the core component of Hadoop platform. Aiming at the design defect of NameNode, a method of pressure decomposition is proposed, which is in the range of acceptable performance loss. It reduces the access pressure of the single master node in the HDFS architecture and makes some modifications to the system architecture so that it can take on more access requests on the whole. It also reduces the risk of instability or even crash caused by a single node in excess load, and further improves the robustness of the system. In addition, this paper designs the secondary backup of HDFS metadata information. The reliability of the system is further improved. This paper also studies the data flow processing mechanism of Map/Reduce, another core component of Hadoop, aiming at its shortcomings of consuming too much resources. This paper proposes an optimization method that can effectively reduce the resource consumption of Map/Reduce under certain circumstances. This paper improves the data structure of metadata under these special circumstances. The Map/Reduce can obtain metadata information from HDFS before processing data flow, locate the data block accurately and filter out unnecessary data processing. The support of HDFS to Map/Reduce can be further improved, which can effectively reduce the resource consumption in the process of data processing and avoid the waste of resources. At the end of this paper, through many experiments, the optimized system is compared with the data processing in the original architecture. The experimental data show that the improved system is balanced in the case of resource consumption and load pressure. Be able to achieve the desired results. This paper is supported by the National Natural Science Foundation of China (Grant No.: 60970064).
【学位授予单位】:武汉理工大学
【学位级别】:硕士
【学位授予年份】:2013
【分类号】:TP333
【参考文献】
相关期刊论文 前10条
1 蒙安泰;;分布式文件系统中元数据管理机制的研究[J];电脑知识与技术;2011年35期
2 钱宏蕊;;云存储技术发展及应用[J];电信工程技术与标准化;2012年04期
3 吴海佳;陈卫卫;刘鹏;董继光;;云存储系统中基于更新日志的元数据缓存同步策略[J];电信科学;2011年09期
4 刘正伟;文中领;张海涛;;云计算和云数据管理技术[J];计算机研究与发展;2012年S1期
5 任宇宁;;云计算时代的存储技术——云存储[J];科技传播;2012年03期
6 冀素琴;石洪波;卫洁;;基于Map Reduce的Bagging贝叶斯文本分类[J];计算机工程;2012年16期
7 邓鹏;李枚毅;何诚;;Namenode单点故障解决方案研究[J];计算机工程;2012年21期
8 傅颖勋;罗圣美;舒继武;;安全云存储系统与关键技术综述[J];计算机研究与发展;2013年01期
9 方少卿;周剑;张明新;;基于Map/Reduce的改进选择算法在云计算的Web数据挖掘中的研究[J];计算机应用研究;2013年02期
10 徐小龙;周静岚;杨庚;;一种基于数据分割与分级的云存储数据隐私保护机制[J];计算机科学;2013年02期
相关硕士学位论文 前2条
1 叶雄杰;基于云存储的移动视频监控系统研究[D];广东工业大学;2011年
2 李宽;基于HDFS的分布式Namenode节点模型的研究[D];华南理工大学;2011年
,本文编号:1374367
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/1374367.html