Hadoop小文件处理技术的研究与优化
[Abstract]:With the rapid development of the Internet and the exponential growth of digital information, mankind has entered a large number. The era of evidence. In the aspect of data storage and computing, the traditional method has no advantage. How to store large quantities of data efficiently and reasonably has become the focus of various industries at home and abroad. In view of the high demand for data computing and data storage, the concept of cloud computing has come into being. With the rapid development of cloud computing technology, storage and computing has become the most popular research field. Hadoop is an open source project of the Apache Foundation, it shows outstanding performance in distributed storage and distributed computing. At present, more and more universities and enterprises begin to use Hadoop to support their business and demand. Although Hadoop is specially designed to store and compute big data, when Hadoop stores small files, it will bring huge memory pressure to the master node, affect the access efficiency of files, and indirectly affect the computational efficiency of MapReduce programming model. Based on the MapReduce computing model of Hadoop and the distributed file system of HDFS, this paper focuses on the general optimization of small file processing technology based on Hadoop. In order to solve the problem that Hadoop technology will cause memory waste to NameNode nodes when storing and calculating small files, low efficiency of reading files and low computational efficiency of MapReduce model, this paper first studies the small file processing technology of Hadoop itself. The advantages and disadvantages of them are analyzed in depth, and the Hadoop is studied and optimized from the MapReduce level and the HDFS level, which can improve the efficiency of Hadoop storage and compute small files. At the level of MapReduce, the implementation process and InputFormat architecture of MapReduce are studied in depth, and the source code of MapReduce and the implementation of internal methods are analyzed in detail. Through in-depth research and implementation of CombineFileInputFormat abstract classes, the input format of small files is merged at the MapReduce level, which improves the efficiency of computing small files in Hadoop. At the level of HDFS, this paper presents a distributed file system with independent small file processing modules. It does not depend on HDFS, to decouple the whole module from the Hadoop cluster without affecting each other. This module can merge, index map and read small files, and add small file cache module to improve the access efficiency of files, and indirectly improve the efficiency of MapReduce in computing and processing small files. Finally, the experimental results show that the MapReduce processing efficiency of the custom CombineFileInputFormat is higher than that of other input formats. Independent small file processing module also speeds up access to files and reduces the memory pressure on the primary node.
【学位授予单位】:广东工业大学
【学位级别】:硕士
【学位授予年份】:2016
【分类号】:TP311.13
【相似文献】
相关期刊论文 前10条
1 王亚男;文件处理要程序化、制度化和现代化[J];上海海运学院学报;1995年04期
2 李斌;文件处理系统[J];管理科学文摘;1996年04期
3 李文龙;;文件处理新工具[J];办公自动化;2000年03期
4 曾辉;;基于C#的文件处理[J];软件导刊;2006年09期
5 王健;从农业部的经验看提高机关文件工作水平的要素[J];档案学通讯;1993年03期
6 李文龙;文件处理新工具[J];电子科技;2000年03期
7 王方鸿;数字时代的办公文件处理中心[J];信息系统工程;2001年02期
8 何伟;陈永强;;C#的文件处理研究与实例分析[J];电脑知识与技术;2009年21期
9 安忻,,曹润芳;应加强文件处理的法制建设[J];档案学通讯;1994年02期
10 王海玲,崔杜武;文件处理软件的研制[J];管理信息系统;1999年07期
相关重要报纸文章 前5条
1 本报记者 梁图强;文件处理变“聪明”了[N];经济日报;2002年
2 河南 段永军;巧用WPS 2002制作文件处理签模板[N];电脑报;2003年
3 小彭;办公文档批量替换好轻松[N];电脑报;2004年
4 记者 黄继妍;公共机构无纸化办公日趋普遍[N];江西日报;2014年
5 郭涛;StorNext比NAS强在哪里[N];中国计算机报;2008年
相关硕士学位论文 前10条
1 李虎啸;海量qos文件处理与数据分析[D];复旦大学;2013年
2 张翔;基于NoSQL的ETC文件处理系统的设计与实现[D];中国科学院大学(工程管理与信息技术学院);2015年
3 马越;Hadoop平台下的海量小文件处理研究[D];南京邮电大学;2015年
4 姚云飞;Hadoop海量小文件处理技术的应用研究[D];南京邮电大学;2015年
5 关海超;小文件处理及算法并行化在Hadoop上的设计与实现[D];重庆大学;2015年
6 赵菲;Hadoop小文件处理技术的研究与优化[D];广东工业大学;2016年
7 南海涛;泰达电子文件处理系统设计与实现[D];天津大学;2008年
8 刘通;基于HDFS的小文件处理与副本策略优化研究[D];中国海洋大学;2014年
9 李三淼;Hadoop中小文件处理方法的研究与分析[D];安徽大学;2015年
10 摆卿卿;PDF文件处理系统[D];北京交通大学;2009年
本文编号:2377017
本文链接:https://www.wllwen.com/kejilunwen/ruanjiangongchenglunwen/2377017.html