HDFS数据副本随需调整及其放置策略研究

发布时间：2018-06-16 13:33

本文选题：云存储 + 数据副本　；参考：《兰州理工大学》2013年硕士论文

【摘要】：信息技术的持续快速发展带来了对数据存储及作用在数据集上计算的空前要求,科研机构、政府以及企业都面临着海量数据存储成本高、数据管理困难、计算复杂度高、容错率低等难题。为了解决这些问题,云存储应运而生。云存储正是一个以数据为主要资源,为云计算提供底层数据存储的系统,它将网络上分散的、异构的、独立的、海量的存储系统组织成一个可靠的、安全的逻辑意义上的整体,进行统一的管理,从而为用户提供高效的、高可靠的、透明的服务。云存储系统中的数据副本技术是必不可少的数据管理技术。本文基于HDFS云存储集群,主要研究的数据副本技术包括：数据块大小的确定、数据副本创建条件、数据副本创建个数、数据副本删除条件以及确定数据副本放置位置。针对以上所需要研究的内容,文章做了以下几方面的工作：首先,建立文件数据块大小动态调整模型、数据副本创建模型及删除模型；其次,建立数据副本放置的默认模型与动态模型,提出了层次化的机架节点选择算法和数据节点选择算法(该模型中,数据副本个数可以按照需要动态调整)。其中,数据块大小确定策略的优劣将直接影响到Map/Reduce任务数的分配、文件数据块的管理以及网络系统的性能,因此必须结合环境特点与用户需求于一体为文件数据分块；在决定了合适的块大小后,则需要结合云存储系统的特点与用户需求将文件数据写入集群；同时,云存储集群系统还需要解决副本冗余度的问题,即应该为一个文件数据块创建多少个副本的问题；基于数据副本创建条件,必须解决冗余副本的删除问题,以提高集群系统服务效能；在放置数据副本时,文章以减少并优化文件数据在HDFS云存储集群间的传输,达到节省网络带宽和提高HDFS集群系统Map/Reduce计算性能的目的,将数据副本放置策略划分为默认数据副本放置策略和动态数据副本放置策略。
[Abstract]:The continuous and rapid development of information technology has brought unprecedented requirements for data storage and computing on data sets. Scientific research institutions, governments and enterprises are all faced with high cost of massive data storage, difficult data management, and high computational complexity. Problems such as low fault tolerance. In order to solve these problems, cloud storage came into being. Cloud storage is a system that uses data as the main resource to provide the underlying data storage for cloud computing. It organizes distributed, heterogeneous, independent, massive storage systems on the network into a reliable, secure logical whole. Unified management to provide users with efficient, reliable and transparent services. Data replica technology in cloud storage system is an indispensable data management technology. In this paper, based on HDFS cloud storage cluster, the data replica technology is mainly studied, which includes: data block size determination, data replica creation condition, data replica creation number, data replica deletion condition and data replica location. In view of the above research content, this paper has done the following work: first, establish the file data block size dynamic adjustment model, data replica creation model and delete model; secondly, The default model and dynamic model of data replica placement are established, and the hierarchical node selection algorithm and data node selection algorithm are proposed. In this model, the number of data replicas can be dynamically adjusted according to the need. Among them, the decision strategy of data block size will directly affect the distribution of Map-Reduce task, the management of file data block and the performance of network system. After determining the appropriate block size, it is necessary to write file data to the cluster according to the characteristics of cloud storage system and user requirements, and to solve the problem of replica redundancy in cloud storage cluster system. That is, how many copies should be created for a file data block; based on the conditions for creating data replicas, the problem of deleting redundant replicas must be resolved to improve the service efficiency of the cluster system; when placing data replicas, In order to reduce and optimize the transmission of file data between HDFS cloud storage clusters, the paper achieves the goal of saving network bandwidth and improving the performance of Map-Reduce computing in HDFS cluster system. The data copy placement policy is divided into default data copy placement policy and dynamic data copy placement policy.
【学位授予单位】：兰州理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】