Hadoop分布式文件系统小文件数据存储性能的优化方法研究

发布时间：2018-02-27 01:16

本文关键词： 内存消耗二级索引小文件合并热存储　出处：《北京交通大学》2017年硕士论文　论文类型：学位论文

【摘要】：当今社会已进入大数据时代,高效的数据存储和读取已成为人们关注的热点问题,Hadoop在大数据存储方面体现出了良好的数据存储性能,但是最近随着博客、维基百科、空间等一系列的社交应用的广泛应用,小文件数据大量产生,对存储大量小文件数据提出了很大挑战,而Hadoop分布式文件系统由于其单一Namenode的结构,在小文件存储上效率是很低的,并容易导致Namenode瓶颈问题,本文就是在Hadoop分布式文件系统存储小文件上提出新的解决方案并测试其可行性。论文的研究工作得到了国家自然科学基金项目(No.61271308、61172072、61401015),北京市教育委员会研究生学科建设项目和中国电建集团成都勘测设计研究院项目的支持,论文的主要工作如下:首先,论文分析了 Hadoop分布式文件系统的特点及问题:单一 Namenode在存储海量小文件上会产生大量元数据信息,导致Namenode内存消耗过大。因此采用了小文件合并大文件的方案解决,但是,小文件合并大文件后小文件读取需要二次索引才能读取对应小文件,文件读取效率会受一定影响,因此,通过引入二级索引元数据信息以及加入预取和缓存机制来提高小文件的读取效率。通过上述分析,本文提出了一种扩展的Hadoop分布式文件系统框架结构,主要是在用户层和数据存储层中间加了一个数据处理层,主要完成的是小文件合并和文件预取和缓存工作,从而提高小文件存储的存储性能。论文在扩展的Hadoop分布式文件系统框架结构中,主要应用了以下算法:基于文件类型的小文件合并算法,通过将大量小文件按文件扩展名进行简单分类处理,然后合并成大文件,有效地降低了 Namenode的内存消耗;基于文件类型的合并文件元数据二级索引算法,通过提高小文件合并大文件的映射文件的读取速度进而提高了系统整体的文件读取效率;基于动态频率统计的热存储算法,通过将一定时间内将读取频率最高的合并文件保存到文件预取和缓存部分,当用户发出请求读取预取和缓存部分文件时,不需要同Namenode进行交互,可直接读取对应小文件,也提高了文件读取效率。论文最后搭建了 Hadoop伪分布式平台,通过比较原始HDFS存储结构、HAR归档文件、改进HDFS存储结构在Namenode内存消耗、文件写入效率、文件读取效率三个方面进行验证分析,实验结果表明改进的HDFS存储结构虽然一定程度上影响了文件写入效率,但有效地降低了 Namenode的内存消耗,提高了小文件读取效率,因而相比原来的小文件存储方案有更好的存储性能体现。
[Abstract]:Nowadays, the society has entered the era of big data, and efficient data storage and reading has become a hot issue that people pay attention to. Hadoop has shown good data storage performance in the storage of big data, but recently with the blog, Wikipedia, Space is widely used in a series of social applications, and small file data is produced in large quantities, which poses a great challenge to store a large number of small file data. However, Hadoop distributed file system is based on its single Namenode structure. It is inefficient to store small files and can easily lead to Namenode bottlenecks. This paper puts forward a new solution on the Hadoop distributed file system storage small files and tests its feasibility. The research work of this paper has been obtained from the National Natural Science Foundation Project No. 61271308 / 61172072 / 61401015. Set up projects and support from Chengdu Survey and Design Research Institute of China Electric Power Construction Group, The main work of this paper is as follows: firstly, this paper analyzes the characteristics and problems of Hadoop distributed file system: a single Namenode can generate a large amount of metadata information on the storage of large amount of small files. The Namenode memory consumption is too large. Therefore, the solution is to merge large files with small files. However, after small files merge large files, the reading of small files requires two indexes to read the corresponding small files, and the efficiency of file reading will be affected to a certain extent. Therefore, the reading efficiency of small files is improved by introducing secondary index metadata information and adding prefetching and caching mechanisms. Through the above analysis, an extended Hadoop distributed file system framework is proposed in this paper. It mainly adds a data processing layer between the user layer and the data storage layer, which mainly completes the small file merging and file prefetching and caching. In order to improve the storage performance of small file storage, this paper mainly uses the following algorithms in the extended Hadoop distributed file system framework: file type based small file merging algorithm, By simply classifying a large number of small files according to file extensions and then merging them into large files, the memory consumption of Namenode is effectively reduced. By improving the reading speed of the mapping files of small files and merging large files, the overall reading efficiency of the system is improved, and the hot storage algorithm based on dynamic frequency statistics is proposed. By saving the most frequently read merged files to the file prefetch and cache parts in a certain time, when the user sends a request to read the prefetched and cached part files, there is no need to interact with the Namenode, so the corresponding small files can be read directly. Finally, a pseudo-distributed Hadoop platform is built to improve the memory consumption and file writing efficiency of HDFS storage structure in Namenode by comparing the original HDFS storage structure with Har archive file. The experimental results show that the improved HDFS storage structure affects the efficiency of file writing to some extent, but it can effectively reduce the memory consumption of Namenode and improve the efficiency of small file reading. Therefore, compared with the original small file storage scheme has better storage performance.
【学位授予单位】：北京交通大学
【学位级别】：硕士
【学位授予年份】：2017
【分类号】：TP333;TP311.13

【参考文献】