HDFS中文件存储优化的相关技术研究

发布时间：2018-06-24 15:04

本文选题：Hadoop分布式文件系统(HDFS) + 存储节点选择　；参考：《南京师范大学》2013年硕士论文

【摘要】：面对不断增长的海量数据,目前计算机领域提出了一种新的计算模式--云计算,Hadoop是一个可实现大规模分布式计算的开源框架,具有高吞吐量、高可靠性、高可伸缩性等优点,因此被广泛应用在云计算领域。Hadoop中的分布式文件系统HDFS是被设计成适合运行在通用硬件上的分布式文件系统,它是一个高度容错的系统,可以部署在廉价的机器上。HDFS能提供高吞吐量的数据访问,非常适合大规模数据集上的应用,并能够以流的方式读取文件系统中的数据。但是作为一个正在不断发展中的分布式文件系统,HDFS也不可避免的存在一些文件数据存储方面的缺陷。例如HDFS在数据副本存储时,是在机架上随机选择Datanode进行存储,可能导致Datanode负载不均衡,从而影响整个系统的性能：并且HDFS最初是被设计用来流式的存储大文件,未对小文件的存储进行优化,因此在处理小文件时性能十分低下。本文首先对分布式文件系统的发展做一些简要的介绍,然后深入分析了HDFS分布式文件系统,包括其架构、元数据管理、以及文件读写流程等,并且分析了现有的解决HDFS数据存储及小文件存储的一些方案的性能以及不足。本文的主要创新点如下： 1、针对在机架上随机选择Datanode进行数据副本存储时,可能导致Datanode负载不均衡等问题,提出了采用多目标优化技术,基于Datanode的当前运行状态,寻找综合条件最优的Datanode进行数据存储的方法。该方法使得数据副本均衡的存储在Datanode中,也可以提高数据读写的性能。 2、实际的应用中会产生大量的小文件,针对HDFS存储小文件的不足,提出了小文件合并和Client端缓存小文件等策略。在Client端将小文件合并成若干大文件后,将大文件及相关元数据一同存储到HDFS中；在读取某个小文件时,Client端缓存从Datanode返回的包含该小文件的整个大文件,再次读取该小文件,或者大文件中的其它小文件时,可以直接从Client端读取。减少了Client端向Namenode频繁请求元数据的次数,也减少了Client端向Datanode频繁请求数据块的次数,大大降低小文件的存取时间。
[Abstract]:In the face of increasing mass data, a new computing model, cloud computing Hadoop, is proposed in the computer field, which is an open source framework for large-scale distributed computing. It has the advantages of high throughput, high reliability, high scalability and so on. So the distributed file system HDFS, which is widely used in cloud computing. Hadoop, is a distributed file system which is designed to run on general hardware. It is a highly fault-tolerant system. It can be deployed on cheap machines. HDFS can provide high throughput data access, is very suitable for large-scale data set applications, and can read data in file system in a stream way. However, as a developing distributed file system, HDFS inevitably has some defects in file data storage. For example, when HDFS stores a copy of data, it selects the DataNode randomly on the rack for storage, which may result in uneven load of the DataNode, which may affect the performance of the entire system: and HDFS was originally designed to stream large files. Storage of small files is not optimized, so performance is very low when processing small files. This paper first introduces the development of distributed file system, then analyzes the HDFS distributed file system, including its architecture, metadata management, file reading and writing process, etc. The performance and shortcomings of existing solutions to HDFS data storage and small file storage are analyzed. The main innovations of this paper are as follows: 1. Aiming at the problem that data replica storage may be caused by random selection of DataNode on the frame, this paper proposes a multi-objective optimization technique based on the current running state of DataNode. The method of data storage for the DataNode with the best synthesis condition is found. This method makes the data copy balanced storage in the DataNode, but also can improve the performance of data reading and writing. 2. In practical applications, a large number of small files will be produced, aiming at the shortcomings of HDFS storage small files. The strategies of small file merging and client side caching are put forward. After the client side merges the small file into a number of large files, the large file and related metadata are stored in HDFS together; when a small file is read, the client side caches the entire large file containing the small file returned from the DataNode, and reads the small file again. Or other small files in large files, can be read directly from the client side. It reduces the frequent request of metadata from the client side to the Namenode and the frequent request of the data block from the client side to the DataNode, which greatly reduces the access time of small files.
【学位授予单位】：南京师范大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP316.4;TP333

【参考文献】