Hadoop中小文件处理技术的研究与优化

发布时间：2019-05-03 18:45

【摘要】：随着互联网的飞速发展,传统的存储方法已经无法满足当前海量数据的存取需求,海量数据的存储和处理成为当下研究的一个新课题。分布式计算平台Hadoop由于具有高可靠、易扩展、高容错性等特点,已经广泛的运用在云计算领域。由于Hadoop是以流式数据访问模式来处理文件,也可以说是为了存储大文件而设计的。因此,Hadoop在处理大文件时性能表现优异,而在处理小文件时会出现存储效率低的问题。针对此问题,本文分析了前人做的一些研究和改进方案,通过研究其他的方案,找出其中优缺点,并在此基础上做了相应的改进。本文的设计方案是在原分布式文件系统基础上添加一个独立的小文件处理模块,通过小文件处理模块对小文件进行合并,建立文件的索引,以及通过文件缓存预取后传送到HDFS中进行数据的处理。该架构使得HDFS系统在处理小文件的同时不影响对大文件或者对已合并小文件的写入或读取,从而提高系统的存储访问效率。本文的小文件合并和索引方案是在HAR的基础上改进的。通过创建小文件的时间段对合并文件命名。此外根据小文件名称以及扩展名建立小文件到具体数据块以及数据块内地址信息的Trie树索引,并根据相应的扩展名对所对应的索引进行分片,从而建立两级索引机制,放置在小文件处理模块中以加快系统中小文件的检索性能。文件的预取是通过文件的元数据和索引信息以及文件的预取记录在小文件处理模块的缓存池里进行索引预取与相关文件的预取。论文给出了该优化方案在Hadoop集群的具体实现,包括小文件合并MapReduce自定义输入分片、两级索引的建立等相关算法的实现。另外,设立性能评价指标来量化分析小文件的内存使用效率和访问效率。最后通过实验比较小文件优化处理后的方案、HAR方案和原HDFS方案在处理小文件时的性能。实验结果表明,本文提出的小文件处理的优化方案在内存使用效率,访问效率方面都要比原HDFS方案和HAR方案占优势。
[Abstract]:With the rapid development of Internet, the traditional storage methods can not meet the current needs of mass data access, the storage and processing of mass data has become a new research topic. Distributed computing platform Hadoop has been widely used in cloud computing due to its high reliability, easy expansion and high fault tolerance. Because Hadoop processes files in streaming data access mode, it is also designed to store large files. As a result, Hadoop performs well in processing large files and low storage efficiency in processing small files. In order to solve this problem, this paper analyzes some research and improvement schemes made by predecessors, and finds out its advantages and disadvantages by studying other schemes, and makes corresponding improvements on this basis. The design scheme of this paper is to add an independent small file processing module on the basis of the original distributed file system, through the small file processing module to merge the small files, and to establish the file index. And through the file cache pre-fetching and transmission to the HDFS for data processing. This architecture enables the HDFS system to process small files without affecting the writing or reading of large files or merged small files, so as to improve the storage access efficiency of the system. The scheme of combining and indexing small files in this paper is improved on the basis of HAR. Name the merged file by creating a small file over a period of time. In addition, according to the name and extension of the small file, the Trie tree index of the small file to the specific data block and the address information in the data block is established, and the corresponding index is partitioned according to the corresponding extension, thus the two-level index mechanism is established. Placed in the small file processing module to speed up the system small and medium-sized file retrieval performance. The pre-fetching of the file is based on the metadata and index information of the file and the pre-fetching record of the file in the cache pool of the small file processing module for index prefetching and the pre-fetching of the related files. In this paper, the implementation of the optimization scheme in Hadoop cluster is given, including the implementation of small file merging MapReduce custom input slicing, the establishment of two-level index and other related algorithms. In addition, the performance evaluation index is set up to quantitatively analyze the memory efficiency and access efficiency of small files. Finally, the performance of the small file optimization scheme, the HAR scheme and the original HDFS scheme are compared by experiments. The experimental results show that the optimization scheme of small file processing proposed in this paper is superior to the original HDFS scheme and the HAR scheme in terms of memory usage efficiency and access efficiency.
【学位授予单位】：河北大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP333

【参考文献】