HDFS下文件存储研究与优化

发布时间：2018-06-26 12:41

本文选题：云存储 + Hadoop　；参考：《广东工业大学》2013年硕士论文

【摘要】：近年来云计算得到广泛的研究与应用,并迅速成为计算机领域最为热门的话题。云存储是在云计算概念基础上延伸和发展出来的一个新概念,其中又以Hadoop框架的HDFS存储系统最为著名。研究发现,网络中存在大量的重复数据,数据的重复存储会对空间造成极大浪费；而且小文件数量众多,加之读写请求频繁,所有的请求都由HDFS系统中唯一的NameNode进行处理,会导致整个系统性能急剧下降。论文首先对Hadoop系统架构及实现技术进行了全面分析,并介绍了重复数据删除相关技术,同时分析了HDFS在处理大量小文件时存在的不足,为论文的下一步研究提供理论依据。本文在传统HDFS体系架构的基础上,提出了一种新的HDFS体系架构,并对元数据管理和文件操作流程进行了设计。针对网络中存在大量重数据及小文件的问题,分别设计了相应的处理策略。本文的主要研究内容和创新点如下： (1)基于传统的HDFS提出了一种新的HDFS体系架构,即在每个机架新增一台NameNode负责本机架事务的处理。分析了主NameNode和机架内NameNode元数据缓存及恢复机制,并对文件操作的元数据获取过程进行了重新设计。 (2)针对重复数据的问题,本文采用双重认证的方式。首先设计了关键词提取策略,对提取结果进行哈希计算,在此基础上结合文本相似匹配技术完成重复数据的判定。此策略避免了固定长度分块重复数据删除技术的弊端,对重复数据的判定更加智能化,在节省存储空间的同时加强了重复数据删除的准确性和科学性。 (3)针对小文件的处理,结合小文件合并方案,对元数据的结构、缓存内容以及更新机制进行了分析。同时,对小文件读、写和删除操作流程进行了详细分析设计。由于将小文件进行合并,节省了系统存储空间,且机架内NameNode完成了本机架内大部分请求的处理,有效缓解了主NameNode负担,从而进一步优化了系统性能。根据设计方案,文章最后进行了相应的仿真实验,从实验结果可以看出,本文的设计在重复数据删除的准确性和科学性、小文件I/O速度及NameNode内存使用率与CPU使用率等方面的性能都有不同程度地提升,从而说明了设计的有效性和科学性。
[Abstract]:Cloud computing has been widely studied and applied in recent years, and has quickly become the hottest topic in computer field. Cloud storage is a new concept extended and developed on the basis of cloud computing concept, among which HDFS storage system of Hadoop framework is the most famous. The study found that there are a lot of duplicate data in the network, and the repeated storage of the data will cause a great waste of space; moreover, the large number of small files and frequent requests for reading and writing, all requests are handled by the unique name Node in the HDFS system. It can lead to a sharp decline in the performance of the entire system. Firstly, the architecture and implementation technology of Hadoop system are analyzed, and the related techniques of repeated data deletion are introduced. At the same time, the shortcomings of HDFS in dealing with a large number of small files are analyzed, which provides a theoretical basis for the next research of this paper. Based on the traditional HDFS architecture, this paper proposes a new HDFS architecture, and designs the metadata management and file operation flow. Aiming at the problem of large amount of heavy data and small files in the network, the corresponding processing strategies are designed. The main contents and innovations of this paper are as follows: (1) A new HDFS architecture based on traditional HDFS is proposed, in which a new NameNode is added to each rack to handle the native rack transaction. This paper analyzes the cache and recovery mechanism of the main NameNode and the NameNode metadata in the rack, and redesigns the metadata acquisition process of the file operation. (2) aiming at the problem of repeated data, this paper adopts the method of double authentication. First, the keyword extraction strategy is designed, and the hash calculation of the extracted results is carried out. On this basis, the duplicate data is judged by combining the text similarity matching technique. This strategy avoids the drawback of the fixed length block repeat data deletion technology, and it is more intelligent to judge the repeated data. While saving storage space, the accuracy and scientificalness of duplicate data deletion are strengthened. (3) the structure, cache content and update mechanism of metadata are analyzed according to the processing of small files, combined with the scheme of small file merging. At the same time, the operation flow of reading, writing and deleting small files is analyzed and designed in detail. Because the small files are merged, the storage space of the system is saved, and the NameNode in the rack completes the processing of most requests in the native rack, which effectively alleviates the burden of the main NameNode and further optimizes the system performance. According to the design scheme, the paper carries on the corresponding simulation experiment at the end, from the experimental result, we can see that the design of this paper is accurate and scientific in the duplicate data deletion. The performance of small file I / O speed, NameNode memory usage and CPU usage are improved to some extent, which shows that the design is effective and scientific.
【学位授予单位】：广东工业大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】