Hadoop小文件处理技术的研究与优化

发布时间：2018-12-13 18:28

【摘要】：随着互联网的快速发展,数字信息呈现指数级的增长,人类已经迈进了大数.据的时代。在数据存储和计算方面,传统的方法已经显得越来越没有优势。怎样高效以及合理地存储计算大批量数据已经成为国内外各行各业关心的重点。鉴于对数据计算以及数据存储的高要求,云计算的概念由此产生。随着云计算技术的迅速发展,存储和计算成为最热门的研究范畴。Hadoop是Apache基金会的一个开源的项目,它在分布式存储以及分布式计算方面表现出杰出的性能,引发了国内外广泛的关注,目前,越来越多的高校以及企业都开始应用Hadoop支撑自己的业务与需求。尽管Hadoop是专门为存储和计算大数据而设计的,但是当Hadoop存储小文件时,会给主节点带来巨大的内存压力,影响文件的访问效率,并且间接影响MapReduce编程模型的计算效率。本文基于Hadoop的MapReduce计算模型和HDFS分布式文件系统两个核心内容,着重研究了基于Hadoop的小文件处理技术的通用优化。为了解决Hadoop技术在存储和计算小文件时会给NameNode节点造成内存的浪费、读取文件效率低下以及MapReduce模型计算效率低的问题,首先研究Hadoop本身自带的小文件处理技术,深入分析了它们的优缺点,并分别从MapReduce层面和HDFS层面对Hadoop进行研究与优化,提高Hadoop存储、计算小文件的效率。在MapReduce层面,对MapReduce的执行流程以及InputFormat体系结构进行深入研究,详细分析MapReduce源代码以及内部方法的具体实现。通过深入研究以及实现CombineFileInputFormat抽象类,在MapReduce层面对小文件的输入格式进行合并,提高了Hadoop对小文件的计算效率。在HDFS层面,本文提出一种具有独立小文件处理模块的分布式文件系统,它不依赖于HDFS,整个模块可以和Hadoop集群做到很好的解耦,互不影响。此模块对小文件进行合并,索引映射以及读取,并加入小文件缓存模块,提高文件的访问效率,并间接提高MapReduce在计算处理小文件时的效率。最后,通过实验验证,自定义的CombineFileInputFormat在MapReduce的处理效率上高于其他的输入格式。独立的小文件处理模块,也加速了对文件的访问,并且降低了主节点的内存压力。
[Abstract]:With the rapid development of the Internet and the exponential growth of digital information, mankind has entered a large number. The era of evidence. In the aspect of data storage and computing, the traditional method has no advantage. How to store large quantities of data efficiently and reasonably has become the focus of various industries at home and abroad. In view of the high demand for data computing and data storage, the concept of cloud computing has come into being. With the rapid development of cloud computing technology, storage and computing has become the most popular research field. Hadoop is an open source project of the Apache Foundation, it shows outstanding performance in distributed storage and distributed computing. At present, more and more universities and enterprises begin to use Hadoop to support their business and demand. Although Hadoop is specially designed to store and compute big data, when Hadoop stores small files, it will bring huge memory pressure to the master node, affect the access efficiency of files, and indirectly affect the computational efficiency of MapReduce programming model. Based on the MapReduce computing model of Hadoop and the distributed file system of HDFS, this paper focuses on the general optimization of small file processing technology based on Hadoop. In order to solve the problem that Hadoop technology will cause memory waste to NameNode nodes when storing and calculating small files, low efficiency of reading files and low computational efficiency of MapReduce model, this paper first studies the small file processing technology of Hadoop itself. The advantages and disadvantages of them are analyzed in depth, and the Hadoop is studied and optimized from the MapReduce level and the HDFS level, which can improve the efficiency of Hadoop storage and compute small files. At the level of MapReduce, the implementation process and InputFormat architecture of MapReduce are studied in depth, and the source code of MapReduce and the implementation of internal methods are analyzed in detail. Through in-depth research and implementation of CombineFileInputFormat abstract classes, the input format of small files is merged at the MapReduce level, which improves the efficiency of computing small files in Hadoop. At the level of HDFS, this paper presents a distributed file system with independent small file processing modules. It does not depend on HDFS, to decouple the whole module from the Hadoop cluster without affecting each other. This module can merge, index map and read small files, and add small file cache module to improve the access efficiency of files, and indirectly improve the efficiency of MapReduce in computing and processing small files. Finally, the experimental results show that the MapReduce processing efficiency of the custom CombineFileInputFormat is higher than that of other input formats. Independent small file processing module also speeds up access to files and reduces the memory pressure on the primary node.
【学位授予单位】：广东工业大学
【学位级别】：硕士
【学位授予年份】：2016
【分类号】：TP311.13

【相似文献】