面向归档存储的重复数据删除优化方法研究

发布时间：2018-06-05 10:01

本文选题：重复数据删除 + 分布式存储　；参考：《华中科技大学》2013年硕士论文

【摘要】：随着社会信息化水平的提高，数据变得越来越重要。与此同时，企业数据中心的存储需求量呈爆炸式增长。目前的存储系统主要是从数据的读写性能和可靠性方面进行设计，忽略了数据之间的关联和冗余特性。这不仅造成了存储空间的浪费，也使得用户难以对数量庞大、结构复杂的数据进行有效的管理。针对此，近年来出现了重复数据删除技术（De-duplication）。在分析重复数据删除系统中元数据访问、查询特性和数据的布局及读写特性的基础上，给出了一种元数据与数据分离的重复数据删除系统架构方案：（1）采用由客户端、元数据服务器和存储节点构成的三方架构；（2）将元数据访问分离到客户端与元数据服务器间，将文件内容访问分离到客户端与存储节点间，从而该方案具有高可扩展性和高访问并发性。在去重功能上，（1）采用固定分块的数据划分方法，使用哈希算法MD5、SHA-1等作为数据分块的哈希指纹；（2）使用两层Bloom Filter对数据分块的哈希指纹进行快速判别和过滤，并使用B+树索引结构作为哈希指纹元数据的持久化存储方案。为了进一步优化I/O性能，（1）采用按照数据流分区域存储的数据布局策略，获得数据访问的空间局部性；（2）结合客户端元数据及数据缓存机制，提高文件访问的缓存命中率和文件读写的性能。最后，设计并实现了一个三方架构的重复数据删除系统原型，在系统原型之上进行了功能和性能测试。功能测试结果表明，上述重复数据删除方案在虚拟机镜像的测试集下能获得130%的数据压缩率；性能测试结果表明，缓存机制可以提高文件访问的性能；指纹过滤统计表明，采用的两层Bloom Filter具有较高的指纹过滤率，0.071%的实际误判率在0.1%的理论误判率所允许的范围内。
[Abstract]:With the improvement of the level of social information, data become more and more important. At the same time, the enterprise data center storage demand is explosive growth. The current storage system is designed mainly from the aspects of read and write performance and reliability of data, ignoring the correlation and redundancy between data. This not only causes the waste of storage space, but also makes it difficult for users to manage the huge and complicated data effectively. In view of this, in recent years, the repeated data delete technology has appeared De-duplex replication. Based on the analysis of metadata access, query characteristics, data layout, read and write characteristics in the repetitive data deletion system, a scheme of metadata separation from data deletion system architecture: 1) is presented. The tripartite architecture of metadata server and storage node separates metadata access between client and metadata server, and file content access between client and storage node. Therefore, the scheme has high scalability and high access concurrency. (1) using fixed block data partition method, and using hash algorithm MD5SHA-1 as data block hashing fingerprint / 2) using two-layer Bloom Filter to quickly distinguish and filter the hash fingerprint of data partitioning. B tree index structure is used as the persistent storage scheme of hash fingerprint metadata. In order to optimize I / O performance further, the spatial locality of data access is obtained by using the data layout strategy which is stored according to the area of data flow) and the mechanism of client metadata and data cache is combined. Improve file access cache hit rate and file read and write performance. Finally, we design and implement a prototype of repetitive data deletion system based on the three-party architecture, and test the function and performance of the system on top of the prototype. The function test results show that the data compression ratio of the proposed duplicate data deletion scheme can reach 130% under the virtual machine image test set, and the performance test results show that the cache mechanism can improve the performance of file access, and fingerprint filtering statistics show that, The two-layer Bloom Filter has a high fingerprint filtering rate of 0.071% and the actual error rate is within 0.1% of the theoretical error rate.
【学位授予单位】：华中科技大学
【学位级别】：硕士
【学位授予年份】：2013
【分类号】：TP333

【参考文献】