面向文件级重复数据删除的稀疏索引技术

发布时间：2018-03-01 03:32

本文关键词： 重复数据删除技术稀疏索引虚拟机映像文件文件重复局部性　出处：《国防科学技术大学》2012年硕士论文　论文类型：学位论文

【摘要】：重复数据删除技术是近几年来存储领域的研究热点，在提高存储利用率、减少数据传输带宽占用量方面表现优异，近几年来广泛应用在数据备份系统、归档存储系统、远程灾备等系统中。多数大规模数据中心中存在大量重复的数据，极大的浪费了存储资源及能耗。云计算数据中心中存在大量虚拟机映像文件，这些虚拟机映像文件中存在大量的重复数据。删除其中的重复数据不仅可以节省磁盘空间，也能减少虚拟机映像文件传输时占用的带宽，提高虚拟机映像文件的访问及分发速度。重复数据删除中普遍存在的磁盘访问瓶颈问题制约了系统的性能，现有解决磁盘瓶颈问题的方法主要包括Data Domain提出的基于bloom filters、SISL和局部性保持，稀疏索引及Extreme Binning的解决方案。已有解决方案通常利用数据访问的局部性来减少内存中索引数量，减少每个索引查找时的磁盘I/O次数。这些研究方案在基于文件的更大粒度的重复数据删除下并不适用，均无法达到解决重复数据删除中的磁盘访问瓶颈问题。本文首先针对云计算中大量虚拟机映像文件下基于文件的重复数据删除中存在的磁盘访问瓶颈问题，提出了一种基于随机抽样的重复数据删除方法。内存中的索引不是全部的文件索引而是从全部文件索引中进行随机抽样出的部分样本索引，从而减少内存中索引的总数量。在文件索引的重复检测时利用虚拟机映像文件重复的局部性特征利用部分文件索引在内存中命中的情况推断目录中其他文件索引的命中情况。使得目录中每个文件索引的重复检测工作无需多次访问磁盘。通过设计实现一般情况下和随机抽样下的重复数据删除算法，对比随机抽样重复数据删除效果可知，当内存无法放置整个索引表的情况时，随机抽样的重复数据删除下内存稀疏索引数为原有索引数的1/10，不仅保证了可观的重复数据删除比例，同时减少了大量的磁盘访问量，解决了当内存无法放置整个索引表时检测性能急剧下降的问题，提高了重复数据删除系统的性能。针对基于文件的重复数据检测中索引容量瓶颈的问题，为提高稀疏索引下重复数据删除比例，进一步提出了基于分组目录的重复数据删除方式。首先对虚拟机映像文件的树结构进行相近粒度的划分为分组目录，对分组目录采用随机抽样及基于Broder理论的抽样方法，利用分组目录的样本文件索引建立内存的稀疏索引，在分组目录的基础上实现了重复数据删除方案。实现方案并利用实验数据检验并验证了虚拟机映像文件重复的局部性。针对不同抽样比例因子对比了重复数据删除效果，结果表明，当抽样内存索引数为原有的1/10时，利用虚拟机映像文件重复固有的局部性特征实现的基于分组目录的稀疏索引能够达到96%以上的重复数据删除比例，减少了内存索引数量，很好的避免了磁盘访问瓶颈问题。实验分析分组目录抽样中随机抽样及基于Broder理论的抽样方法下的重复数据删除比例，通过设定不同分组目录划分大小分析基于分组目录的重复数据删除比例的影响因子。最后根据实验结果对比基于分组目录稀疏索引和基于随机抽样的稀疏索引下的重复数据删除效果。针对集中式重复数据删除系统有限的可扩展性提出了分布式环境下的重复数据删方案，，实现了重复数据删除过程的分布并行化，数据的分布式存储，设计简单的路由算法从而使得多个数据节点间的独立自治。提出了简单易行的数据迁移策略，并分析分布式环境下重复数据删除方案的特点、可实施性及对整体系统性能的影响。避免分布式环境下数据节点间相互通信带来的负面影响，实现数据节点间的独立自治，达到重复数据删除过程的分布并行化目的。
[Abstract]:Data deduplication technology is a hot research field of storage in recent years, in improving storage utilization, reduce the transmission bandwidth amount of outstanding performance in recent years is widely used in the data backup system, storage system, remote disaster recovery system. A large number of repeated data are most of the large scale data centers. A great waste of storage resources and energy consumption. A large number of cloud computing virtual machine image files in the data center, there are a lot of duplicate data of these virtual machines in the image file. Delete duplicate data which can not only save disk space, but also can reduce the virtual machine image file transmission bandwidth, improve the virtual machine image file access and the distribution of speed.
Repeat data exists delete disk access bottleneck of the system performance, the existing methods of solving the bottleneck problem of the disk is mainly composed of Bloom filters based on Data proposed by Domain, SISL and local, sparse solution index and Extreme Binning. Have local solutions usually use data to reduce the number of memory access in each index, reduce the number of disk I/O search index. These methods are not applicable in duplicate data based on larger size of file deletion, were unable to duplicate data delete to solve the bottleneck problem in disk access.
In this paper, a large number of virtual machines in cloud computing image file exists under duplicate data delete file in disk based on the access bottleneck problem, proposes a method to delete duplicate data based on random sampling. In memory index not all file index but some sample index sampling from all the documents in the index, so to reduce the total amount of memory in the index. The situation that hit local character repeatedly hit list in memory by file indexing using virtual machine image file in the file when the duplicate detection index in the index file. He makes repeated detection of each file in the directory index without multiple access disk. By design the duplicate data in general and random sampling deletion algorithm, compared to random sampling data deduplication result shows that when the memory The index table can not be placed in case of repeated random sampling data, delete the memory for the original sparse index number index number 1/10, not only to ensure that the considerable de duplication ratio, while reducing the number of disk accesses, solve the detection performance decreased sharply when the memory can not be placed the entire index table when problems to improve the performance of data deduplication system.
According to the repeated data file detection index capacity bottleneck problem based on repeated data to improve the sparse index under the delete ratio, further proposed to delete duplicate data packets based on directory. The first tree structure on the virtual machine image files are divided into similar size packet to packet directory directory, by random sampling and sampling method based on Broder theory, a sparse index using the sample memory file index directory packet, packet based directory on the realization of the deduplication scheme. And the implement scheme according to test and verify the local virtual machine image file by using the number of repeat. According to different sampling scale factor comparison data deduplication effect results show that when sampling memory indexes for the original 1/10, realize the local characteristics of the virtual machine image file repeat inherent base In the sparse index directory can achieve the duplicate data packet more than 96% delete proportion, reduce the memory index number, very good to avoid the disk access bottlenecks. Experimental analysis of random grouping and duplicate data catalogue sampling sampling method based on Broder theory under the delete proportion, by setting different packet size effects of repeated directory partition delete directory data packet ratio based on factor. Finally, according to the experimental results based on a comparison of group index sparse index and remove duplicate data sparse index random sampling based on the results.
For the centralized data deduplication system limited scalability of duplicate data delete scheme under the distributed environment, the realization of distributed deduplication process in parallel, distributed data storage, routing algorithm design simple so that a plurality of data between nodes. Autonomous migration strategy simple data is presented, and analysis of characteristics of repeated data in a distributed environment to delete the program, the implementation and impact on the overall performance of the system. To avoid the negative impact of distributed environment of data communication between the nodes, the data nodes autonomous, parallel to the distribution to the deduplication process.

【学位授予单位】：国防科学技术大学
【学位级别】：硕士
【学位授予年份】：2012
【分类号】：TP333

【参考文献】