基于Hadoop的分布式文件系统技术分析及应用

发布时间：2018-07-17 07:47

【摘要】：随着互联网(主要为移动互联网)和新兴物联网的高速发展,我们生活在一个数据大爆炸时代。根据IDC估计,2011年,全球产生和创建的数据总量为1.8ZB,且全球的信息总量每过两年就会增长一倍。产生这么多的数据,自然而然就会给我们在数据存储和管理上带来巨大的挑战。IDC的研究报告还指出,全球数据存储容量的增长速度已远远跟不上的数据的增长速度了。这么多的数据存储在一个设备上在当今的存储技术下是很难办到的,并且存储在一个设备上,会对以后数据的分析带来很大的困难。把数据存储在多个设备上,是我们现今存储海量数据的首选。既然存储在多个存储设备上,那么就需要我们有相应的分布式文件系统来管理这些存储设备,使它们能够协同工作,并可以向用户提供更好的数据访问性能。 Hadoop分布式文件系统(HDFS),一个类似Google的分布式文件系统(GFS)的出现是可以解决海量数据存储需求的一个很好应用。首先它是一个开源免费的应用并且在很多节点上已经部署,具有不凡的表现。其次,HDFS拥有高容错性、高可靠性、高扩展性和高吞吐率等特征,这些特征都为海量数据提供了安全存储的环境和对超大数据集(Large Data Set)的应用处理带来了很大便利。它还可以与MapReduce编程模型很好的结合,并且能够为应用程序提供高吞吐量的数据访问。在本论文中,首先以时间为轴,介绍了每个时代典型的分布式文件系统及其特点,然后对HDFS的体系架构和运行原理进行了详细分析。通过对HDFS高可用性的研究,结合了BackupNode和AvatarNode这两种方案的优点设计出了一个高可用的分布式文件系统,我们称之为HADFS。该文件系统不仅实现了NameNode的热备节点,还可以在当NameNode节点发生故障时,能够自动切换到备用节点,而用户却察觉不到节点的切换。最后,我们以HDFS为基础存储层设计出了一个可以实现文件上传、下载、新建文件夹和删除文件等功能的云盘系统。该系统采用了SSH框架设计,并在与HDFS传输数据的时候采用了webdav协议,使云盘的前端与底层存储实现了很好的分离。
[Abstract]:With the rapid development of the Internet (mainly mobile Internet) and the emerging Internet of things, we live in a data Big Bang era. According to IDC estimates, the total amount of data generated and created globally was 1.8 ZB in 2011, and the global amount of information doubled every two years. Generating so much data naturally poses a huge challenge in data storage and management. IDC's report also points out that the growth of global data storage capacity is far from keeping up with the growth of data. It is very difficult to store so much data on one device under the current storage technology, and it will bring great difficulty to the analysis of data in the future. Storing data on multiple devices is our preferred choice for storing massive amounts of data today. Since it is stored on multiple storage devices, we need to have the appropriate distributed file systems to manage these storage devices so that they can work together, Hadoop distributed file system (HDFS), a distributed file system similar to Google (GFS), is a good application to solve the requirement of massive data storage. First, it is an open source free application and has been deployed on many nodes, with extraordinary performance. Secondly, HDFS has the characteristics of high fault tolerance, high reliability, high scalability and high throughput. These features provide a secure storage environment for massive data and great convenience for the application and processing of large data sets. It also combines well with MapReduce programming model and provides high throughput data access for applications. In this paper, the typical distributed file system and its characteristics in each era are introduced on the axis of time, and then the architecture and running principle of HDFS are analyzed in detail. By studying the high availability of HDFS, combining the advantages of backup Node and Avatar Node, a highly available distributed file system is designed, which we call HADFS. The file system not only implements the hot node of NameNode, but also can automatically switch to the standby node when the node of NameNode fails, but the user can not detect the switch of the node. Finally, we design a cloud disk system which can upload, download, create new folder and delete files based on HDFS. The system is designed by SSH framework, and webdav protocol is used to transmit data with HDFS, which makes the front end of the cloud disk separate from the underlying storage.
【学位授予单位】：武汉理工大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333;TP316.4

【参考文献】