基于HDFS分布式并行文件系统副本策略研究

发布时间：2018-08-02 12:10

【摘要】：近年来,随着科学技术的进一步发展,全球数据量出现高速增长,特别是更加注重用户的交互作用的Web2.0的出现,改变了过去用户只能作为互联网读者的角色,用户成为了互联网内容的创作者。在这样的海量信息环境中,传统的存储系统已经不能满足信息量高速增长的要求,在容量和性能的要求上存在瓶颈,诸如硬盘数量、服务器数量等的限制。 HDFS(Hadoop Distributed File System)是不同于传统分布式并行文件系统的,运行于廉价的机器上的,具有高吞吐量、高容错性、高可靠性的新型分布式文件系统。具有数据分布存储与管理功能,并提供高性能的数据访问与交互。在分布式并行文件系统HDFS中,副本是其重要的组成部分,副本技术更是协调互联网中各个节点资源完成高效且工作量较大的任务,实现这一任务的途径即通过副本放置、副本选择、副本调整等方式提高数据在各节点间的有效传输。本文首先对副本管理策略的研究现状作了分析,总结了前辈们在该领域已有的研究成果以及它们的局限性；在此基础上对HDFS系统架构及其读写机制等关键技术进行深入分析和阐述,并在此基础上建立HDFS动态副本管理模型,从副本放置和副本删除两个方面展开了论述。然后,根据副本放置策略的改进思想进行算法的设计,提出了基于距离和负载信息的副本放置策略,引进平衡因子调节距离和负载的比重满足不同用户对系统的要求；同时,根据副本调整阶段的需求,改进副本删除策略,引入副本评价函数,提出基于价值评估的副本删除策略；最后,通过仿真模拟实验,对本文提出的副本策略进行有效性验证,并与HDFS默认副本策略进行对比分析。本文的主要贡献在于： 1)分析了HDFS分布式并行文件系统与传统分布式系统的区别,重点与GFS进行了对比分析,分析两者的设计思想和原则,比较副本管理策略的异同,说明HDFS是GFS的简化设计,具有更加灵活的操作性。 2)提出了一种基于距离和负载信息的副本放置策略。该策略改变了HDFS默认副本放置策略的随机存储算法,综合考虑了副本大小、传输带宽以及节点负载三方面影响因素,计算出节点的效用值,优先选择效用值大的节点存储数据块,并引入平衡因子,满足不同用户对系统性能的要求。最后模拟实验验证了本文算法在负载均衡上较HDFS默认放置策略具有明显的优越性。 3)提出了一种基于价值评估的副本删除策略。当有新的副本写入请求时,Namenode节点随机获取一组Datanode,选择一个节点写入数据。若被选择的节点已有副本数量太多,负载太重,性能就不能有效发挥；HDFS默认副本调整策略没有考虑到这一点,改进的策略通过价值评估函数计算副本的价值,并进行排序,当节点负载过大时,删除价值最小的副本,以此来释放节点空间,充分发挥节点效用,实验表明,在大文件写入测试中,本文策略较HDFS默认策略具备更高的性能。
[Abstract]:In recent years, with the further development of science and technology, the rapid growth of global data, especially the emergence of Web2.0 which pays more attention to the interaction of users, has changed the role of the past users only as the Internet reader, and the user has become the creator of the Internet content. In such a mass information environment, the traditional storage system It can not satisfy the requirement of high-speed growth of information, and there are bottlenecks in capacity and performance, such as the number of hard disks, the number of servers and so on.
HDFS (Hadoop Distributed File System) is a new distributed file system with high throughput, high fault tolerance and high reliability, which is different from the traditional distributed parallel file system. It has high throughput, high fault tolerance and high reliability. It has the function of data distribution and management, and provides high performance data access and interaction.
In the distributed parallel file system (HDFS), replica is an important part of the system. Replica technology is a task to coordinate the efficient and heavy workload of each node resource in the Internet. The way to achieve this task is to improve the effective transmission of data among the nodes through replica placement, copy selection, copy adjustment and so on.
This paper first analyzes the research status of copy management strategy, summarizes the research achievements and limitations of predecessors in this field. On this basis, it analyzes and expounds the key technologies such as HDFS system architecture and its reading and writing mechanism, and builds a HDFS dynamic copy management model on this basis. Two aspects are discussed. Then, the algorithm is designed based on the improved idea of replica placement strategy. A copy placement strategy based on distance and load information is proposed. The balance factor is introduced to adjust the proportion of the distance and load to meet the requirements of different users to the system; meanwhile, according to the requirement of the replica adjustment phase. The copy deleting strategy is improved, the replica evaluation function is introduced and the copy deletion strategy based on the value evaluation is proposed. Finally, the validity of the replica strategy is verified by the simulation experiment, and the HDFS default copy strategy is compared and analyzed.
The main contributions of this article are as follows:
1) the difference between the HDFS distributed parallel file system and the traditional distributed system is analyzed. The emphasis is compared with the GFS, the design ideas and principles of the two are analyzed, and the similarities and differences of the copy management strategy are compared. It shows that HDFS is a simplified design of GFS and has a more flexible operation.
2) a copy placement strategy based on distance and load information is proposed. This strategy changes the random storage algorithm of the HDFS default copy placement strategy. It considers the size of the replica, the transmission bandwidth and the three aspects of the node load, and calculates the utility value of the node. The balance factor can meet the requirements of different users for the performance of the system. Finally, the simulation experiment shows that the proposed algorithm is superior to the HDFS default placement strategy in load balancing.
3) a copy deletion strategy based on value evaluation is proposed. When a new copy is written to the request, the Namenode node randomly acquires a set of Datanode and selects one node to write the data. If the number of selected nodes has too many copies and the load is too heavy, the performance will not be effective; the HDFS default copy adjustment strategy does not take this into account. One point is that the improved strategy calculates the value of the copy through the value evaluation function and makes the sorting. When the node is overloaded, the minimum value of the copy is deleted to release the node space and give full play to the node utility. The experiment shows that the strategy has higher performance than the HDFS default strategy in the large file writing test.
【学位授予单位】：浙江师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP316.4;TP333

【参考文献】