Hadoop中小文件处理技术的研究与优化
[Abstract]:With the rapid development of Internet, the traditional storage methods can not meet the current needs of mass data access, the storage and processing of mass data has become a new research topic. Distributed computing platform Hadoop has been widely used in cloud computing due to its high reliability, easy expansion and high fault tolerance. Because Hadoop processes files in streaming data access mode, it is also designed to store large files. As a result, Hadoop performs well in processing large files and low storage efficiency in processing small files. In order to solve this problem, this paper analyzes some research and improvement schemes made by predecessors, and finds out its advantages and disadvantages by studying other schemes, and makes corresponding improvements on this basis. The design scheme of this paper is to add an independent small file processing module on the basis of the original distributed file system, through the small file processing module to merge the small files, and to establish the file index. And through the file cache pre-fetching and transmission to the HDFS for data processing. This architecture enables the HDFS system to process small files without affecting the writing or reading of large files or merged small files, so as to improve the storage access efficiency of the system. The scheme of combining and indexing small files in this paper is improved on the basis of HAR. Name the merged file by creating a small file over a period of time. In addition, according to the name and extension of the small file, the Trie tree index of the small file to the specific data block and the address information in the data block is established, and the corresponding index is partitioned according to the corresponding extension, thus the two-level index mechanism is established. Placed in the small file processing module to speed up the system small and medium-sized file retrieval performance. The pre-fetching of the file is based on the metadata and index information of the file and the pre-fetching record of the file in the cache pool of the small file processing module for index prefetching and the pre-fetching of the related files. In this paper, the implementation of the optimization scheme in Hadoop cluster is given, including the implementation of small file merging MapReduce custom input slicing, the establishment of two-level index and other related algorithms. In addition, the performance evaluation index is set up to quantitatively analyze the memory efficiency and access efficiency of small files. Finally, the performance of the small file optimization scheme, the HAR scheme and the original HDFS scheme are compared by experiments. The experimental results show that the optimization scheme of small file processing proposed in this paper is superior to the original HDFS scheme and the HAR scheme in terms of memory usage efficiency and access efficiency.
【学位授予单位】:河北大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP333
【参考文献】
相关期刊论文 前9条
1 李旭;李长云;张清清;胡淑新;周玲芳;;Hadoop中处理海量小文件的方法[J];计算机系统应用;2015年11期
2 尹颖;林庆;林涵阳;;HDFS中高效存储小文件的方法[J];计算机工程与设计;2015年02期
3 左大鹏;徐薇;;基于Hadoop处理小文件的优化策略[J];软件;2015年02期
4 黄山;王波涛;王国仁;于戈;李佳佳;;MapReduce优化技术综述[J];计算机科学与探索;2013年10期
5 付松龄;廖湘科;黄辰林;王蕾;李姗姗;;FlatLFS:一种面向海量小文件处理优化的轻量级文件系统[J];国防科技大学学报;2013年02期
6 王铃惠;李小勇;张轶彬;;海量小文件存储文件系统研究综述[J];计算机应用与软件;2012年08期
7 赵晓永;杨扬;孙莉莉;陈宇;;基于Hadoop的海量MP3文件存储架构[J];计算机应用;2012年06期
8 陈剑;龚发根;;一种优化分布式文件系统的文件合并策略[J];计算机应用;2011年S2期
9 汪志莉;沈富可;;一种基于哈希表和Trie树的快速内容路由查找算法[J];计算机应用与软件;2009年10期
相关硕士学位论文 前7条
1 左大鹏;Hadoop小文件存储管理的研究与实现[D];北京交通大学;2015年
2 郑丽洁;小文本语料库在Hadoop平台上的存储策略研究[D];华中师范大学;2014年
3 张波;HDFS下文件存储研究与优化[D];广东工业大学;2013年
4 高蓟超;Hadoop平台存储策略的研究与优化[D];北京交通大学;2012年
5 蔡睿诚;基于HDFS的小文件处理与相关MapReduce计算模型性能的优化与改进[D];吉林大学;2012年
6 曹风兵;基于Hadoop的云计算模型研究与应用[D];重庆大学;2011年
7 江柳;HDFS下小文件存储优化相关技术研究[D];北京邮电大学;2011年
,本文编号:2469176
本文链接:https://www.wllwen.com/kejilunwen/jisuanjikexuelunwen/2469176.html