一种优化HDFS小写文件存储策略研究与实现

发布时间：2018-08-30 13:23

【摘要】：随着互联网数据迅猛增长，在大数据时代存储和处理这些海量数据成为最大的挑战之一，各种各样的云存储系统开始涌现，国内外公司都投入到各自云存储系统研究和开发中。HDFS是Google GFS开源实现的分布式文件系统，专门用于存储海量大数据，具有高可靠性、高可用性、高伸缩性等特点。HDFS集群采用主从架构，一个中心节点用于保存文件系统的元数据，许多个数据节点用来存放实际的数据。大文件被分割多个块，被存放在数据节点中，分布在不同数据节点上。当HDFS应用于含有大量的小文件场景中，会造成中心节点内存急剧消耗，限制HDFS集群容量，同时造成中心节点洪泛查询的压力。论文研究了HDFS自带的小写文件存储的解决方案，它们采用远端合并压缩的方法，但是由于存在多级索引过程，导致读写性能低下。针对HDFS自带方案的不足，提出了一种客户端小写文件合并策略。该方案将小文件在客户端缓存合并成一个大文件，同时小文件在大文件的偏移信息写入大文件的开头部分，然后作为一个文件块存入数据节点；在数据节点端添加小文件映射表，实现了对原生Inode结构的拓展；在数据节点通过小文件索引信息，提取小文件内容；并通过采用缓存预取策略来提高读取性能。最后设计测试方案，，对拓展系统进行了内存占用、读写性能等方面的测试，通过与原系统小文件存储方案进行性能比较，发现系统内存使用节省达70%，写文件时间平均缩短20%，通过预取策略文件读时间平均缩短40%。
[Abstract]:With the rapid growth of Internet data, storage and processing of these massive data has become one of the biggest challenges in big data's time, and various cloud storage systems have begun to emerge. HDFS is a distributed file system implemented by Google GFS open source, which is specially used to store huge amounts of big data, with high reliability and high availability. High scalability. HDFS cluster adopts master-slave architecture, a central node is used to store metadata of the file system, and many data nodes are used to store actual data. Large files are divided into blocks, stored in data nodes and distributed on different data nodes. When HDFS is applied to the scenario containing a large number of small files, it will cause a sharp consumption of memory in the center node, limit the capacity of the HDFS cluster, and at the same time cause the pressure of flooding query of the center node. In this paper, we study the solutions to the storage of lowercase files in HDFS. They adopt the method of remote merging and compression, but because of the existence of multilevel index process, the performance of reading and writing is low. Aiming at the shortage of HDFS's own scheme, this paper puts forward a strategy of client case file merging. In this scheme, the small files are merged into a large file in the client cache, and the small files are written into the beginning part of the large files in the offset information of the large files, and then stored as a file block in the data node. By adding small file mapping table to the data node, the native Inode structure is extended; the small file content is extracted through the small file index information in the data node; and the reading performance is improved by using the cache prefetching strategy. Finally, the test scheme is designed to test the memory occupation, read and write performance of the extended system, and the performance of the extended system is compared with that of the original small file storage scheme. It is found that the system memory is saved up to 70 percent, the average writing time is shortened by 20 percent, and the average reading time by prefetching policy files is shortened by 40 percent.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】