基于Hadoop的云存储平台的研究与实现

发布时间：2018-09-04 11:51

【摘要】：近年来，云计算日益成为国内外关注的焦点。当云计算系统中运算和处理的核心是大量数据的存储时，云计算系统就衍变为一个云存储系统。云计算的飞速发展，使云存储也成为当前业界最热门的研究领域。云存储作为一种新的服务，，它将用户的数据存储在云端服务器上，用户只要通过互联网登录云存储服务系统，就可以在任何地方任何时候访问自己的数据，并且不用担心数据会丢失。 Hadoop是Apache开发的一种开源的分布式计算平台，在分布式计算和数据存储方面表现出优异的性能，引起了国内外知名IT企业的关注，各大企业和科研机构纷纷投入研究，使得Hadoop在云计算和云存储中的应用越来越广泛。HDFS是Hadoop的分布式文件系统，它具有强大的数据存储能力，适合云存储系统。但它在设计上存在一些缺陷，性能上并不完美，要想大规模推广使用，必须先进行改进。本文主要研究基于HDFS的云存储模型，针对HDFS在小文件存储不理想和副本分布不均衡两个问题上对其进行改进，并使用改进后的HDFS搭建云存储平台。主要工作如下： 1. HDFS为确保数据存储的可靠性，采用副本机制将文件的副本存储在集群中。文件副本以数据块的形式存放在不同的DataNode上，然而HDFS默认的副本分布策略具有随机性，不能保证副本均衡地分布在集群中。为解决这一问题，本文提出了一种基于加权评价指标矩阵选择距离最优解最近、最差解最远的节点的算法，对权值的确定采用层次分析法进行计算，在兼顾节点负载的同时，着重考察空间使用率，选择最合适的DataNode来放置数据副本，使各DataNode的空间负载整体均衡。 2. HDFS是为大文件设计的，不适合大量小文件的存储。相同数据量情况下，小文件会浪费NameNode的内存，同时降低访问效率。针对这一问题，本文对HDFS的文件存储过程进行改进，在文件上传到HDFS集群之前先进行判断，如果是小文件则需要进行合并优化处理，并将小文件的索引信息以键值对的形式保存在索引文件中。改进方案减小了大量小文件对NameNode内存的消耗，并提高了访问效率。 3.进行大量实验，将原HDFS与改进方案进行对比，实验结果证明，本文提出的改进方案具有更好的效果，能够改善HDFS的性能。使用改进后的Hadoop搭建存储集群，开发Web应用程序，通过B/S模式模拟云存储平台，实现云存储的相关功能。
[Abstract]:In recent years, cloud computing has increasingly become the focus of attention at home and abroad. When the core of computing and processing in cloud computing system is the storage of a lot of data, cloud computing system evolves into a cloud storage system. With the rapid development of cloud computing, cloud storage has become the hottest research field. Cloud storage, as a new service, stores the user's data on the cloud server. As long as the user logs on to the cloud storage service system through the Internet, he can access his data anywhere at any time. Hadoop is an open source distributed computing platform developed by Apache. It has shown excellent performance in distributed computing and data storage, and has attracted the attention of well-known IT enterprises at home and abroad. Many large enterprises and scientific research institutions have put into research, making the application of Hadoop in cloud computing and cloud storage more and more extensive. HDFS is a distributed file system of Hadoop, which has powerful data storage ability and is suitable for cloud storage system. However, it has some defects in design, and its performance is not perfect. If it is to be widely used, it must be improved first. In this paper, the cloud storage model based on HDFS is studied, and the HDFS is improved on the problems of poor storage of small files and uneven distribution of replica. The improved HDFS is used to build cloud storage platform. The main work is as follows: 1. In order to ensure the reliability of data storage, HDFS uses replica mechanism to store copies of files in the cluster. File replicas are stored on different DataNode in the form of data blocks. However, the default replica distribution policy of HDFS is random and cannot guarantee the balanced distribution of replicas in the cluster. In order to solve this problem, this paper proposes an algorithm based on weighted evaluation index matrix to select the node nearest to the best solution and farthest from the worst solution. The weight of the node is determined by the analytic hierarchy process (AHP), and the load of the node is taken into account at the same time. Focus on space utilization, select the most appropriate DataNode to place data copies, so that the overall balance of the DataNode space load. 2. 2. HDFS is designed for large files and is not suitable for storage of large numbers of small files. With the same amount of data, small files waste NameNode memory and reduce access efficiency. In order to solve this problem, this paper improves the file stored procedure of HDFS, judges the file before uploading it to the HDFS cluster, and if it is a small file, it needs to combine and optimize. The index information of the small file is stored in the index file as a key-value pair. The improved scheme reduces the consumption of a large number of small files to NameNode memory, and improves the access efficiency. 3. 3. A large number of experiments have been carried out to compare the original HDFS with the improved scheme. The experimental results show that the proposed scheme has better effect and can improve the performance of HDFS. The improved Hadoop is used to build the storage cluster, develop the Web application program, simulate the cloud storage platform through the B / S mode, and realize the related function of cloud storage.
【学位授予单位】：电子科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】