基于HDFS的小文件处理与相关MapReduce计算模型性能的优化与改进

发布时间：2018-06-06 04:10

本文选题：Hadoop + HDFS　；参考：《吉林大学》2012年硕士论文

【摘要】：随着Internet的飞速发展，数据呈爆炸式的增长，传统的技术架构已经越来越不能适应当前海量数据的需求。因此，有关海量数据的处理与存储成为时下研究的热潮。DongCutting等人通过借鉴google的论文开发了hadoop这个分布式计算平台，来完成有关大量搜索引擎的索引计算。虽然hadoop本身被设计为处理流式的大文件，但是随着hadoop应用的不断推广，现在各行各业，各个领域都在使用hadoop做大量的计算，这样就导致需求被扩大。小文件处理成为hadoop平台的一个瓶颈。本文针对hadoop平台处理小文件，通过研究目前解决方案，提出了自己的解决方法。小文件是指文件大小小于HDFS上的块（block）大小（一般为64M）的文件。大量的小文件会严重影响hadoop的性能和其扩展性。原因主要有两点，一是在HDFS中，任何block，文件或者目录在内存中均以对象的形式存储，每个对象约占内存150byte，如果有1千万个小文件，则namenode需要2G空间（存两份），，如果小文件数量增至1亿个，则namenode需要20G空间。小文件耗费了namenode大量的内存空间，这使namenode的内存容量严重制约集群的扩展和其应用。其次，访问大量小文件速度远远小于访问几个大文件。HDFS最初是为流式访问大文件开发的，如果访问大量小文件，需要不断的从一个datanode跳到另一个datanode，严重影响性能。最后，处理大量小文件速度远远小于处理同等大小的大文件的速度。每一个小文件要占用一个slot，而task启动将耗费大量时间甚至大部分时间都耗费在启动task和释放task上。本文通过建立一个新的基于hdfs的顶层文件系统HSF来解决小文件的储存管理。HSF小文件文件系统，通过对小文件进行分类，不同的小文件采取不同的解决方式。针对本身大小就非常小的文件（例如图片），采用SequenceFile作为容器，来对小文件进行合并，并建立高效的索引机制来完成用户对原先的小文件进行的随机访问；对可以合并的文件，直接进行合并，同样建立索引来解决小文件的随机访问。本文对小文件随机访问的索引机制，采用了二级索引的方式，索引键为hash值，并且采用了缓存机制，在内存中适当保存索引表，这样当访问同一合并后的文件中的小文件时，就会提高小文件随机访问的效率，进而解决小文件给hadoop系统带来的问题。文章实验部分，对本文所说的文件系统进行了系统的测试，利用了不同种类的数据，建立不同的实验用例。包括直接读取小文件和读取合并后的小文件，这将对二进制图片文件和文本文件分开进行实验，验证其从本地文件系统读和hdfs中读的效率都呈现线性增长，不会因为数据量增大而影响系统的正常运行；采用MapReduce提供的样例程序WordCount，对合并后的文件和未合并的小文件进行对比，采用的是文本文件进行的实验，这个实验充分验证本文件系统适用于MapReduce计算模型；随机读取小文件，对二进制图片文件和文本文件分开进行实验，本实验验证了本系统小文件随机访问的高效性，要比Hadoop本身的归档har文件系统性能高。
[Abstract]:With the rapid development of Internet, the data show an explosive growth, the traditional technology architecture has become more and more unable to adapt to the demand of the current mass data. Therefore, the processing and storage of massive data has become an upsurge of current research..DongCutting and others have developed the distributed computing platform by using the paper of Google to complete the distributed computing platform. As an index calculation about a large number of search engines, although Hadoop itself is designed as a large file for processing flow, with the continuous promotion of Hadoop applications, all walks of life now use Hadoop to do a lot of computing, which leads to the expansion of demand. Small file processing has become a bottleneck for the Hadoop platform.
In this paper, Hadoop platform for processing small files, through the study of the current solution, put forward their own solution. Small files are file size less than the size of the HDFS (block) size (generally 64M) files. A large number of small files will seriously affect the performance and scalability of Hadoop. The main reasons are two, one is in HDFS, any block Files or directories are stored in the form of objects in memory, each object is about 150byte of memory. If there are 10 million small files, namenode needs 2G space (two copies), if the number of small files is increased to 100 million, the namenode needs 20G space. Small files consume a large amount of memory in namenode, which makes the memory of namenode strict. Second, the speed of accessing a large number of small files is far less than the access to a few large files..HDFS was initially developed for a large file. If a large number of small files were accessed, it needed to jump from one datanode to another datanode, and the performance was seriously affected. Finally, the speed of processing a large number of small files was far smaller. The speed of dealing with large files of equal size. Each small file takes up a slot, and task startup will spend a lot of time and even most of the time spent on starting task and releasing task.
In this paper, a new HDFS based top layer file system (HSF) is established to solve the small file storage and management of.HSF small file file system. Through the classification of small files, different small files take different solutions. In view of the small files in their own size (such as pictures), SequenceFile is used as a container for the small text. The components are merged, and the efficient index mechanism is set up to complete the random access to the original small files; to merge the files that can be merged directly, and to establish the cable to solve the random access of the small files. In this paper, the index mechanism of the random access of small files is used in the way of two level index, and the index key is hash value. And using the caching mechanism, the index table is properly stored in the memory, so when the small files in the same merged file are accessed, the efficiency of the random access of the small files will be improved, and the problems brought to the Hadoop system by the small files are solved.
In the experiment part, the paper systematically tests the file system described in this paper, and uses different kinds of data to establish different experimental cases, including reading small files directly and reading the merged small files, which will separate the binary picture files and text files into the experiment, and verify that it is read from the local file system and HDFS. The efficiency of reading has a linear growth, which does not affect the normal operation of the system because of the increase of data. The sample program WordCount provided by MapReduce compares the merged files with the unmerged small files, and uses the experiment of the text file, which fully validating that the file system is suitable for the MapReduce meter. Calculation model; read the small files randomly, experiment on binary image files and text files separately. This experiment verifies the efficiency of the random access of the small file in this system, which is higher than the Hadoop file system of the har file system itself.
【学位授予单位】：吉林大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP338.8

【引证文献】